IAM

Check out the latest superpixel benchmark — Superpixel Benchmark (2016) — and let me know your opinion! @david_stutz
05thJANUARY2018

READING

Yukun Zhu, Raquel Urtasun, Ruslan Salakhutdinov, Sanja Fidler. segDeepM: Exploiting segmentation and context in deep neural networks for object detection. CVPR, 2015.

Zhu et al. show that object detection benefits from using object segmentation proposals and segmentation-based features as well as larger context. Their work is largely based on ideas of [8] and [9]. In particular, they propose three important improvements and experimentally show their effectiveness:

  1. Using a markov random field, their model reasons jointly over segmentations (obtained from CPMC [3]) and candidate bounding boxes. The segmentations are also used to compute features.
  2. Use a larger context for bounding box candidates by enlarging the original bounding box by a fixed percentage.
  3. Iteratively refine the bounding boxes by repeatedly performing bounding box prediction on the final candidate set. After each prediction set, the features are re-computed if individual bounding boxes changed significantly.

Table 1 shows experimental results demonstrating the applicability of these three improvements over the baseline model R-CNN. While the appearance features and context features (using the enlarged bounding box) are computed using pre-trained neural networks, specifically AlexNet [15], the segmentation features are mostly hand-crafted. It would be interested to see whether it is possible to utilize pre-trained networks for the segmentation-based features, as well. Figure 1 additionally shows detection results and the corresponding segmentations selected by their approach.

Table 1: Accuracy for individual classes as well as overall mAP showing the improvement of the discussed techniques over R-CNN. Here, seg refers to using the segmentation features, exp to the enlarged bounding boxes/context features, ibr to iterative bounding box refinment and br to regular bounding box refinement (i.e. once).

Figure 1: Qualitative results showing detection and recognition as well as the corresponding segmentations. From left to right: ground truth, R-CNN, the proposed approach, segments selected by the proposed approach.

  • [3] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. CVPR, 2013.
  • [8] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun. Bottom-up segmentation for top-down detection. CVPR, 2013.
  • [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524, 2013.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, 2012.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: