Fast R-CNN is Girshick’s improved R-CNN model addressing the problem of slow, multi-stage training. The high-level approach is summarized in Figure 1. The two points are addressed using several contributions. First, an Region-of-Interest pooling layer is introduced to share computation during training and evaluation. Second, a multi-task loss is used to train features, classifier and bounding box regressor simultaneously. Finally, weights in fully connected layers are replaced by SVD approximations to reduce computational overhead.
Figure 1: High level overview of the proposed system including the Region-of-Interest (RoI) pooling layer.
Region-of-Interest pooling is similar to a Spatial Pyramid Pooling layer  with only one pyramid-layer. In particular, given a feature map and a region of interest (in the form of a bounding box), the RoI layer converts features in these regions into a fixed $H \times W$ feature map. To this end, the region of interest is subdivided into a a fixed number of regions which are then max-pooled. This is applied for each channel separately. The backward pass of this layer is similar to the backward pass in max-pooling layers. During training this allows training on mini-batches while sharing computation. For example taking two images per mini-batch, each providing $64$ bounding boxes, the computationally expensive part of the network is evaluated only twice.
The multi-task loss combines classification and bounding box regression by adding both losses. For bounding box regression a smooth $L_1$ loss is used, see the paper for details. The bounding box regression loss is only enabled if the input sample is not background.
For fast evaluation of a fully connected layer, the corresponding weight matrix $W$ is approximated by
$W \approx U \Sigma_t V^T$
where $\Sigma_t$ contains the $t$ dominant singular values and $U$ and $V$ are also accordingly truncated. In experiments Girshick shows that this speeds up the computation by up to 30% while reducing performance by only 0.3 percentage points (in mean Average Precision).
The experimental section of the paper can also be recommended. In particular, Girshick shows that the network does not need to be applied in a multi-scale fashion to reach good performance and multi-task learning is beneficial over using a separate SVM classifier and bounding box regressors. He also experiments with different numbers of input proposals and states that too many proposals are not necessarily beneficial.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: