Jonas Uhrig, Marius Cordts, Uwe Franke, Thomas Brox. Pixel-Level Encoding and Depth Layering for Instance-Level Semantic Labeling. GCPR, 2016.

Uhrig et al. propose an extension of FCN-8s [21] to simultaneously tackle instance-level semantic segmentation and depth ordering. The approach is rather simple while still outperforming state-of-the-art on KITTI [10] and CityScapes [6].

Figure 1: High-level view of the given approach. The FCN-8s is used to predict the 3 pixel-level maps depicted in the second column. Template matching is then used to recover instances.

The overall approach is illustrated in Figure 1. Given an input image, the FCN predicts semantic labels, a depth label and the direction to the center of the object. The latter is used to predict instances using template matching. Instances are then identified in the template matching map by iteratively finding maxima. The ground truth used for training (on an ImageNet initialized model) can be inferred from the instance-level ground truth provided by KITTI and CityScapes.

Figure 2: Qualitative results on CityScapes showing the input image as well as ground truth and predicted instances. The last two rows additionally show the three predicted channels for the third image.

Qualitative results on CityScapes are shown in Figure 2.

  • [6] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The Cityscapes Dataset for semantic urban scene understanding. In: CVPR (2016).
  • [10] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012).
  • [21] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015).
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.