Yukun Zhu, Raquel Urtasun, Ruslan Salakhutdinov, Sanja Fidler. segDeepM: Exploiting segmentation and context in deep neural networks for object detection. CVPR, 2015.

Zhu et al. show that object detection benefits from using object segmentation proposals and segmentation-based features as well as larger context. Their work is largely based on ideas of [8] and [9]. In particular, they propose three important improvements and experimentally show their effectiveness:

  1. Using a markov random field, their model reasons jointly over segmentations (obtained from CPMC [3]) and candidate bounding boxes. The segmentations are also used to compute features.
  2. Use a larger context for bounding box candidates by enlarging the original bounding box by a fixed percentage.
  3. Iteratively refine the bounding boxes by repeatedly performing bounding box prediction on the final candidate set. After each prediction set, the features are re-computed if individual bounding boxes changed significantly.

Table 1 shows experimental results demonstrating the applicability of these three improvements over the baseline model R-CNN. While the appearance features and context features (using the enlarged bounding box) are computed using pre-trained neural networks, specifically AlexNet [15], the segmentation features are mostly hand-crafted. It would be interested to see whether it is possible to utilize pre-trained networks for the segmentation-based features, as well. Figure 1 additionally shows detection results and the corresponding segmentations selected by their approach.

Table 1: Accuracy for individual classes as well as overall mAP showing the improvement of the discussed techniques over R-CNN. Here, seg refers to using the segmentation features, exp to the enlarged bounding boxes/context features, ibr to iterative bounding box refinment and br to regular bounding box refinement (i.e. once).

Figure 1: Qualitative results showing detection and recognition as well as the corresponding segmentations. From left to right: ground truth, R-CNN, the proposed approach, segments selected by the proposed approach.

  • [3] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. CVPR, 2013.
  • [8] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun. Bottom-up segmentation for top-down detection. CVPR, 2013.
  • [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524, 2013.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, 2012.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.