Girshick et al. propose R-CNN, an object detector which (based on several improvements [1, 2]) long defined the state-of-the-art in object detection and related tasks. The original paper not only describes the first “version” of R-CNN but also provides a thorough experimental study on hyper-parameters and design choices. In its core, the object detection pipeline consists of several modules: an object proposal module (such as selective search ), a feature extractor and a classifier. The pipeline additionally includes some post-processing steps such as non-maximum suppression and bounding box regression.
Based on proposals from selective search, features are extracted using a pre-trained convolutional neural network such as AlexNet (referred to as T-Net)  or VGG16 (referred to as O-Net) . The 4096-dimensional feature fector of the last fully connected layer is used as features. All proposed bounding boxes are warped to the input size of the network (details can be found in the appendix). The prediction layer of the network is removed and the (pre-trained) network is fine-tuned for object detection. At test time, the computed features are then fed into a class-specific SVM. For all modules, Girshick et al. Detail the used hyper-parameters and training schemes.
In addition to non-maximum suppression, bounding box regression is employed to improve localization. Girshick et al. Detail the approach in the appendix. In short, linear regressors are trained to shift (and scale) a bounding box proposal for better localization.
The object detector is evaluated on PASCAL VOC and ILSVRC where the approach is shown to outperform the state-of-the-art. For details, refer to the paper. In addition, Girshick et al. Provide an ablation study to investigate the design choices. These include a non-fine-tuned model as well as using different layers as features. Interestingly, the fully connected layers are very sensitive to fine tuning while the last convolutional layer is not. They also provide visualization where individual units are interpreted as object detectors and the highest-confidence boudning boxes corresponding to these units are visualized as in Figure 1.
I can only recommend reading the paper including its experimental section as well as follow-up work [1, 2].