Zhu et al. show that object detection benefits from using object segmentation proposals and segmentation-based features as well as larger context. Their work is largely based on ideas of  and . In particular, they propose three important improvements and experimentally show their effectiveness:
Table 1 shows experimental results demonstrating the applicability of these three improvements over the baseline model R-CNN. While the appearance features and context features (using the enlarged bounding box) are computed using pre-trained neural networks, specifically AlexNet , the segmentation features are mostly hand-crafted. It would be interested to see whether it is possible to utilize pre-trained networks for the segmentation-based features, as well. Figure 1 additionally shows detection results and the corresponding segmentations selected by their approach.
Table 1: Accuracy for individual classes as well as overall mAP showing the improvement of the discussed techniques over R-CNN. Here, seg refers to using the segmentation features, exp to the enlarged bounding boxes/context features, ibr to iterative bounding box refinment and br to regular bounding box refinement (i.e. once).
Figure 1: Qualitative results showing detection and recognition as well as the corresponding segmentations. From left to right: ground truth, R-CNN, the proposed approach, segments selected by the proposed approach.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: