W. Kuo, B. Hariharan, J. Malik. DeepBox: Learning Objectness with Convolutional Networks. ICCV, 2015.

Kuo et al. propose DeepBox, a convolutional network based object proposal detector similar to (and partly based on) Edge Boxes []. The approach is based on AlexNet [] while gradually reducing its complexity. Concretely, they report improved performance in terms of AUC (=Area Under the Curve corresponding to plotting Recall against the number of proposals) when dropping the 3rd, 4th and 5th convolutional layers, reducing the input size to $120 \times 120$ and dropping some units in the fully connected layers. The final architecture consists of 2 convolutional layers (the first followed by pooling) and one fully connected layer before the final classification layer.

For efficient proposal detection, they further utilize the idea of the Spatial Pyramid Pooling (SPP) layer [] and run the network on multiple scales. The convolutional layers are computed on the whole image (for multiple scales) and regions corresponding to bounding boxes are then pooled and fed to the fully convolutional layers.

Training proceeds in two steps. Note that the first and second convolutional layers have been initialized on ImageNet (i.e. are taken from a pre-trained AlexNet). In the first step, negatives are sampled in a sliding window fashion and positives are perturbed by randomly translating the corner points of the ground truth bounding boxes. In a second step, hard negatives are acquired using Edge Boxes [].

In experiments, it is shown that the proposed method is superior to Edge Boxes on PASCAL VOC 2007 [] and generalizes well to unseen classes on Microsoft COCO []. Figure 1 shows a Recall curve for DeepBox and Edge Boxes on PASCAL VOC 2007 and Microsoft COCO with Intersection over Union (IoU) set to 0.7.

Figure 1 (click to enlarge): Quantitative results on PASCAL VOC 2007 (left) and Microsoft COCO (right) for IoU = 0.7.

  • [] C. L. Zitnick, P. Dollar. Edge boxes: Locating object proposals from edges. ECCV, 2014.
  • [] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, 2012.
  • [] K. He, X. Zhang, S. Ren, J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014
  • [] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
  • [] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, C. L. Zitnick. Microsoft coco: Common objects in context. ECCV, 2014.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.