Shaoqing Ren, Kaiming He, Ross B. Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015: 91-99

Ren et al. improve upon Fast R-CNN [2] by introduce a Region Proposal Network (RPN) instead of Selective Search [4] or similar proposal detectors. The advantage is that computations can be shared – in particular lower level convolutional layers of deep networks. Beneath the proposed RPN, they also discuss training and present comprehensive evaluation in comparison other object proposal detectors.

The RPN operates on a set of convolutional feature maps (e.g. obtained from a pre-trained deeper network such as VGG16 [3]. At each location it considers a fixed window, in their case a window of size $3 \times 3$. This is naturally realizes through a $3 \times 3$ convolutional layer with a specific number of channels. As result, each windows is represented by a fixed-size feature vector. Two sibling fully connected layers then predict objectness (in the form of a 2-unit softmax for eac class) and bounding boxes. For bounding box regression, 9 different so-called anchor boxes are considered. That is, the bounding box regression layer as $9 \cdot 4$ outputs – one bounding box corresponding to the 9 anchor bounding boxes. Note that this is in contrast to other methods. The network is not trained on different bounding boxes or scales, the scale and aspect ratio prediction is implicit through the regression output. The overall idea is illustrated in Figure 1.

Figure 1: Illustration of the Region Proposal Network (RPN), where a small network is evaluated on $3 \times 3$ patches of the input feature maps.

The used loss for the RPN is a combination of the smooth $L_1$ loss for bounding box regression and a softmax loss for objectness prediction. They also discuss the representation used for bounding box regression relative to the anchor bounding boxes. Details can be found in the paper.

An individual RPN can be trained efficiently on “image-centric” mini-batches, i.e. bounding boxes/proposals coming from a single image in order to share computation. Ren et al. Randomly sample $256$ anchors in the image of which $128$ may be positives. Negatives are defined by an overlap of less than $0.3$ (Intersection-over-Union) and positives by an overlap of more than $0.7$.

The RPN and the detection network, i.e. the R-CNN, are trained jointly in the following 4-step procedure: First, the RPN is trained (initialized from ImageNet). Then, R-CNN is trained on the proposals from the RPN (note that so far, the two networks are separate. Third, the R-CNN is used to initialize the RPN, where the convolutional layers are then fixed and the upper layers are fine-tuned. Finally, keeping the convolutional layers fixed, the R-CNN is fine tuned. It would be interesting to see whether there are “real” end-to-end solutions for training Faster R-CNNs.

  • [2] R. Girshick, “Fast R-CNN,” in IEEE International Conference on Computer Vision (ICCV), 2015.
  • [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
  • [4] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International Journal of Computer Vision (IJCV), 2013.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.