Kuo et al. propose DeepBox, a convolutional network based object proposal detector similar to (and partly based on) Edge Boxes . The approach is based on AlexNet  while gradually reducing its complexity. Concretely, they report improved performance in terms of AUC (=Area Under the Curve corresponding to plotting Recall against the number of proposals) when dropping the 3rd, 4th and 5th convolutional layers, reducing the input size to $120 \times 120$ and dropping some units in the fully connected layers. The final architecture consists of 2 convolutional layers (the first followed by pooling) and one fully connected layer before the final classification layer.
For efficient proposal detection, they further utilize the idea of the Spatial Pyramid Pooling (SPP) layer  and run the network on multiple scales. The convolutional layers are computed on the whole image (for multiple scales) and regions corresponding to bounding boxes are then pooled and fed to the fully convolutional layers.
Training proceeds in two steps. Note that the first and second convolutional layers have been initialized on ImageNet (i.e. are taken from a pre-trained AlexNet). In the first step, negatives are sampled in a sliding window fashion and positives are perturbed by randomly translating the corner points of the ground truth bounding boxes. In a second step, hard negatives are acquired using Edge Boxes .
In experiments, it is shown that the proposed method is superior to Edge Boxes on PASCAL VOC 2007  and generalizes well to unseen classes on Microsoft COCO . Figure 1 shows a Recall curve for DeepBox and Edge Boxes on PASCAL VOC 2007 and Microsoft COCO with Intersection over Union (IoU) set to 0.7.
Figure 1 (click to enlarge): Quantitative results on PASCAL VOC 2007 (left) and Microsoft COCO (right) for IoU = 0.7.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or using the following platforms: