Bharath Hariharan, Pablo Andrés Arbeláez, Ross B. Girshick, Jitendra Malik. Simultaneous Detection and Segmentation. ECCV, 2014.

Hariharan et al. simultaneously tackle object detection and segmentation. In the literature, this work is often cited as describing the instance-level semantic segmentation problem. They also introduced the metrics used for this task, namely $AP^r$, $AP^b$ and the corresponding volumentric metrics.


Their approach can be summarized in 3 steps. First, based on region proposals from [1], they extract features using convolutional neural networks. Then, an SVM is trained to classify the regions based on those features. Finally the initial region proposals are refined. For feature extraction, they use two separately trained network. While both networks are based on the AlexNet [2] architecture, thr first is fine-tuned on bounding boxes and fed both the bounding box content as well as a version of the bounding box where the background is masked out by the mean image. The two resulting features are concatenated. In contrast, the second is fine-tuned on masked bounding boxes. Overall, they train both networks simultaneously and use both networks as features for the subsequent SVM. The network architectures and training is illustrated in Figure 1. Based on region classification, the region proposals are refined. The process is illustrated in Figure 2 and works as follows. First, the bounding box is divided in $10 \times 10$ blocks. For each blocks a trained classifier is used to predict foreground/background using the CNN features. Then, superpixels are extracted and another classifier is trained to combine the coarse segmentation with the generated superpixels. Unfortunately, the description of this process is very short …

Figure 2: Illustration of the two networks for feature extraction which are jointly trained.

Figure 2: Illustration of the region refinement process. From left to right: the original image, the original region proposal taken as input, the coarse $10 \times 10$ mask, the coarse mask projected to superpixels, the output of the final classifier on the superpixels, the refined region.


For evaluation, they propose two metrics, $AP^r$ and $AP^b$ which correspond to average precision (i.e. the area under the precision-recall curve) for regions and bounding boxes respectively. As the threshold for segmentation overlap (and also for bounding box overlap in case of $AP^r$) may vary depending on the application, they propose to use the volume under the $AP^r$ or $AP^b$ surface for different thresholds; these metrics are then denoted $AP^r_{vol}$ and $AP^b_{vol}$.

Experimental results can be found in the paper. Qualitative results are shown in Figure 3.

Figure 3: Qualitative results of the proposed method.

  • [1] Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014).
  • [2] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012).
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.