Menze and Geiger propose a new benchmark for scene flow. This summary will briefly cover ground truth generation on KITTI . Their annotation process consists of two steps: First, the static background is computed by removing moving objects and considering the camera’s egomotion. Second, moving objects are considered by fitting 3D CAD models to the corresponding point clouds. Note that this second step only works for rigidly moving objects (i.e. not for pedestrians etc.).
The static background is recovered by combining GPS/IMU data provided by KITTI and ICP fitting of the 3D point clouds. Seven scans are accumulated over time in a common coordinate system. All points belonging to moving objects are removed using KITTI’s 3D bounding boxes.
For all rigidly moving objects (all vehicles), 3D CAD models are fitted. Menze and Geiger selected a set of 16 representative 3D CAD models from Google 3D Warehouse. This is justified by not requiring millimeter accurate 3D CAD models due to the noise in the observations. The 3D CAD models are sampled to obtain roughly $3000$ points. Then, for each dynamic object, a 3D similarity transformation is estimated. The transformation includes scale and pose of the 3D CAD model in the first frame as well as the 3D rigid body motion. The CAD model is registered by minimizing an energy consisting of tree terms: First, the average truncated $L_2$ distance between 3D point cloud and the nearest neighbors in the CAD model. Second, the truncated $L_1$ distance between the disparity map induced by the CAD model and a Semi-Global Matching (SGM) mp. And Last, the quadratic 2D error of 2D and 3D correspondences. Minimization has to alternate between finding nearest neighbors and minimizing all three terms.