Jun Xie, Martin Kiefel, Ming-Ting Sun, Andreas Geiger. Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer. CVPR, 2016.

Xie et al. propose a CRF-based method for 3D to 2D label transfer in order to provide a large dataset of urban street scenes with semantic (instance) segmentation. As motivation, they point out the effort required for annotating images with dense semantic labels. This is reflected in Figure 1 showing a set of popular datasets ranked according to size and annotation time (per frame).

Figure 1: Illustration of what Xie et al. Call the “Curse of dataset annotation”.

They start by annotating objects in 3D point clouds obtained by combining several point clouds. The point cloud is annotated using shape primitives, e.g. cubes or ellipsoids. This is illustrated in Figure 2 showing the high-level pipeline. The dataset was captured using a setup similar to KITTI [12] and provides roughly 400k images and 100k laser scans.

Figure 2: Illustration of the 3D to 2D label transfer pipeline. This figure also illustrates the 3d annotation process.


To transfer the 3d annotations to 2d, they propose a CRF model of the form

$E(s) = \sum_{i\in\mathcal{P}} \phi_i^{\mathcal{P}}(s_i) + \sum_{l \in \mathcal{L}} \phi_l^{\mathcal{L}} (s_l) + \sum_{m \in \mathcal{F}, i \in \mathcal{P}} \phi_{mi}^{\mathcal{F}}(s_i)$

$+ \sum_{i,j \in \mathcal{P}} \psi_{i,j}^{\mathcal{P},\mathcal{P}} (s_i, s_j) + \sum_{l,k\in \mathcal{L}} \psi_{lk}^{\mathcal{L},\mathcal{L}} (s_l, s_k) + \sum_{i \in \mathcal{P},l \in \mathcal{L}} \psi_{il}^{\mathcal{P},\mathcal{L}}(s_i, s_l)$

where $\mathcal{P}$ denotes the set of pixels, $\mathcal{L}$ the set of sparse 3D points and $\mathcal{F}$ as set of 3D annotations; $s_i$ denotes the label of pixel (or 3d point) $i$. The model consists of unary and pairwise potentials. Without going into details, unary potentials favor a labeling consistent with the 3d annotations. For example, the 3d unary potential penalizes labels the 3d point as not annotated by (note that the 3d point may lie in multiple shape primitves!). The pairwise potentials are based on Gaussian edge kernels, favoring similar labels for 3d points or pixels that are close and similar in color or surface normal. The potentials contain weights that are trained by minimizing the logistic loss assuming a mean field approximation.


They evaluate the approach against several baselines showing that the 3D to 2D label transfer is a good trade-off between annotation time and accuracy. Some qualitative results are shown in Figure 2.

Figure 3: Qualitative results showing the inferred annotation (top), the annotation of the 3d points projected onto the image plane (middle) and the error (bottom).

  • [12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR), 32(11):1231–1237, 2013.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.