Ziyu Zhang, Alexander G. Schwing, Sanja Fidler, Raquel Urtasun. Monocular Object Instance Segmentation and Depth Ordering with CNNs. ICCV, 2015.

Zhang et al. tackle the problem of instance-level segmentation of cars in urban street scenes where the instance predictions are interpreted as depth ordering. On a high-level, their approach can be broken down into two parts: First, a convolutional neural network with architecture depicted in Figure 1 (based on the VGG16 [35]) is applied on overlapping patches to predict a per-pixel depth level (where the number of levels is restricted to six). Afterwards, a MRF is applied in order to merge the predictions from the different patches into a coherent prediction.


Figure 1: Illustration of the network architecture used for the first part of their approach. The network is based on a pre-trained VGG16 model with slight modifications in order to output a $40 \times 40$ map with $6$ channels corresponding to the $6$ depth levels to predict.


For training the network, the VGG16 architecture is slightly adapted to produce a $40 \times 40$ pixel output map. For training, they use [2] in order to get ground truth car segmentation masks from the 3d bounding box annotations provided by KITTI [35]. The network optimizes a per-pixel cross-entropy loss.


Given the network predictions for the individual patches, these are merged in the following way. A connected components algorithm is run on the patches and the predictions within each component are averaged. The overall prediction for a specific pixel is then given by the maximum of the predictions across all depth-levels (6 per patch) and patches. The resulting map is fed into an MRF of the form

$E(y) = \sum_p (E_{\text{CNN},p}(y_p) + E_{\text{CCO},p}(y_p))$

$+\sum_{p,p':\mathcal{C}(p)\neq\mathcal{C}(p')} E_{\text{long},p,p'}(y_p, y_{p'})$

$+\sum_{p,p' \in \mathcal{N}(p)} E_{\text{short},p,p'}(y_p, y_{p'})$


Here, the first potential favors predictions with values higher than the corresponding pixel value. This is motivated by the fact that the CNN prediction is always lower than the true one. The second term encourages instances to be aligned according to the vertical axis, which usually reflects depth in traffic scenes. The third term encourages pixels to be assigned to different states/depths if they belong to different connected components and the last term encourages nearby pixels to be assigned to the same state if this is also true in the CNN labeling. For inference, they resort to a $\alpha-\beta$-like algorithm where in each step quadratic pseudo-boolean optimization is used for the binary inference problem.


Without going into details regarding the evaluation, some results are shown in Figure 2.

Figure 2: Qualitative results showing the original image and the ground truth (middle) as well as the predictions (right).

  • [2] L.-C. Chen, S. Fidler, A. Yuille, and R. Urtasun. Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision. In CVPR, 2014.
  • [35] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. http://arxiv.org/abs/1409.1556, 2014.
  • [6] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In CVPR, 2012.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.