Uhrig et al. propose an extension of FCN-8s  to simultaneously tackle instance-level semantic segmentation and depth ordering. The approach is rather simple while still outperforming state-of-the-art on KITTI  and CityScapes .
Figure 1: High-level view of the given approach. The FCN-8s is used to predict the 3 pixel-level maps depicted in the second column. Template matching is then used to recover instances.
The overall approach is illustrated in Figure 1. Given an input image, the FCN predicts semantic labels, a depth label and the direction to the center of the object. The latter is used to predict instances using template matching. Instances are then identified in the template matching map by iteratively finding maxima. The ground truth used for training (on an ImageNet initialized model) can be inferred from the instance-level ground truth provided by KITTI and CityScapes.
Figure 2: Qualitative results on CityScapes showing the input image as well as ground truth and predicted instances. The last two rows additionally show the three predicted channels for the third image.
Qualitative results on CityScapes are shown in Figure 2.