Jifeng Dai, Kaiming He, Jian Sun. Instance-Aware Semantic Segmentation via Multi-task Network Cascades. CVPR. 2016.

Dai et al. propose Multi-task Network Cascades (MCN) for instance-aware semantic segmentation (without stuff categories, however). Their approach is based on a decomposition of the task into three subtrass: differentiating instances (i.e. detecting instances), estimating masks, categorizing objects. While the proposed approachs demonstrates impressive performance on PASCAL VOC and MS COCO, it is unclear whether this is fully contributed to this decomposition. It may alsobe possible to combine object detection with object classification …

Figure 2: Illustration of the proposed approach, see the text for details.

While detailed figures of the used network architectures are not provided, Figure 1 shows the general structure of their approach, defining input and output to the three subtasks. Stage 1 is trained to predict box proposals; network architecture and employed loss follows the work of Region Proposal Networks [26]. The second stage, given the object proposals as input, employs a differentiable version of Region of Interest pooling [9] to extract features used for mask prediction. The third stage takes the masks as input, uses Region of Interest pooling to extract features which are then masked and predicts object classes. The final loss can then be written as

$L(\theta) = L_1(B(\theta)) + L_2(M(\theta)|B(\theta)) + L_3(C(\theta)|B(\theta),M(\theta))$

Where $L_i$ denotes the loss of stage $i$, $B$ are the predicted proposals in stage 1, $M$ the proposed masks of stage 2 and $C$ the class predictions of stage 3. For end-to-end training this is challengingas the later stages depend on the output of earlier stages. In particular, this requires differentiable Region of Interest pooling layers which are used to extract the features for stages 2 and 3. These layers need to be differentiable with respect to the bounding box or mask input. Details can be found in the paper.

On top of the third stage, they additionally perform class-wise bounding box regression. The regressed bounding boxes are then again fed into stages 2 and 3. In total, this results in a 5-staged model as depcited in Figure 2.

Figure 2: Final 5-staged model as described in the text.

They provide quantiative and qualitative results on PASCAL VOC and MS COCO. And also provide an ablation study showing that end-to-end training of all 5 stages with shared features indeed improves performance over multi-step/multi-network training without shared features. They also outperform related work [13,14,7] on PASCAL VOC. Qualitative results are shown in Figure 3.

Figure 3: Qualitative results of the proposed approach on PASCAL VOC.

  • [26] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [9] R. Girshick. Fast R-CNN. In ICCV, 2015.
  • [13] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV. 2014. [14] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hyper- ´ columns for object segmentation and fine-grained localization. In CVPR, 2015.
  • [7] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.