Song and Xiao propose to use 3D convolutional neural networks for 3D object detection in RGB-D images as provided by the NYU Depth Dataset v2  or the SUN RGBD dataset . The approach is splitted into an object recognition network jointly using 3D shape and 2D color features, and a region proposal network.
The region proposal network is applied to bounding boxes of varying shape and orientation sampled across the whole 3D scene and outputs an objectness score. It is also supposed to perform bounding box regression (i.e. predict the difference between the input bounding box size and the object bounding box size). To this end, the point cloud is voxelized using the directed Truncated Signed Distance Function. This means, that the space is divided into voxels and each voxel is represented by a vector encoding the shortest direction to the surface obtained from the RGB-D image used as input. The 3D scene is aligned with the direction of gravity and sampled with a grid size of $0.025$ meters resulting in a voxel grid of size $208 \times 208 \times 100$. The main directions of the scene are estimated using RANSAC (based on the Manhatten world assumption) and all objects are assumed to be aligned with these directions. At each anchor position of the sliding window based approach, the network predicts $19$ different scores corresponding to bounding boxes illustrated in Figure 1. Additionally, the network operates at different scales — for larger scales an additional pooling layer is used to increase the receptive field. Additionally, bounding box regression is performed by predicting the deviation from the predicted, fixed bounding box as in Figure 1. To this end, a multi-task loss is used where bounding box regression is trained using a smooth $L_1$ loss. The overall architecture is illustrated in Figure 2. For training, samples are labeled according to their 3D intersection over union score and each batch contains the positive and negative samples for a specific image. See the paper for the details.
Figure 1 (click to enlarge): List of the used anchor types, i.e. bounding box shapes. The subtitles of the boxes describe the 3D extends as well as the number of orientations used.
Figure 2 (click to enlarge): Illustration of the network architecture used for the region proposal network.
The object recognition network takes the proposals from the region proposal networks and divides the corresponding space into $30 \times 30 \times 30$ voxels (after padding). The voxel grid is used for classification based on the shape. However, Song and Xiao additionally use color information. Therefore, the points in the point cloud contained within the bounding box are backprojected to the image plane and the VGGnet  (pre-trained on ImageNet) is used to compute color features which are then fed into the overall object recognition network. This structure is illustrated in Figure 3.
Figure 3 (click to enlarge): Illustration of the network architecture used for the object recognition network. The network combines learned features from the 3D shape obtained from a $30 \times 30 \times 30$ voxel grid, and color features obtained from VGGnet .
The results look promising, outperforming their prior work called Sliding Shapes  by a significant margin. Furthermore, they discuss the region proposal network and show a significant performance boost in contrast to a 3D selective search algorithm (see the paper for details). Qualitative results, and a comparison with Sliding Shapes, can be found in Figure 4.