Zhu et al. show that object detection benefits from using object segmentation proposals and segmentation-based features as well as larger context. Their work is largely based on ideas of  and . In particular, they propose three important improvements and experimentally show their effectiveness:
Table 1 shows experimental results demonstrating the applicability of these three improvements over the baseline model R-CNN. While the appearance features and context features (using the enlarged bounding box) are computed using pre-trained neural networks, specifically AlexNet , the segmentation features are mostly hand-crafted. It would be interested to see whether it is possible to utilize pre-trained networks for the segmentation-based features, as well. Figure 1 additionally shows detection results and the corresponding segmentations selected by their approach.
Table 1: Accuracy for individual classes as well as overall mAP showing the improvement of the discussed techniques over R-CNN. Here, seg refers to using the segmentation features, exp to the enlarged bounding boxes/context features, ibr to iterative bounding box refinment and br to regular bounding box refinement (i.e. once).
Figure 1: Qualitative results showing detection and recognition as well as the corresponding segmentations. From left to right: ground truth, R-CNN, the proposed approach, segments selected by the proposed approach.