# DAVIDSTUTZ

15thMARCH2017

S. Song, J. Xiao. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images. CVPR, 2016.

Song and Xiao propose to use 3D convolutional neural networks for 3D object detection in RGB-D images as provided by the NYU Depth Dataset v2 [1] or the SUN RGBD dataset [2]. The approach is splitted into an object recognition network jointly using 3D shape and 2D color features, and a region proposal network.

The region proposal network is applied to bounding boxes of varying shape and orientation sampled across the whole 3D scene and outputs an objectness score. It is also supposed to perform bounding box regression (i.e. predict the difference between the input bounding box size and the object bounding box size). To this end, the point cloud is voxelized using the directed Truncated Signed Distance Function. This means, that the space is divided into voxels and each voxel is represented by a vector encoding the shortest direction to the surface obtained from the RGB-D image used as input. The 3D scene is aligned with the direction of gravity and sampled with a grid size of $0.025$ meters resulting in a voxel grid of size $208 \times 208 \times 100$. The main directions of the scene are estimated using RANSAC (based on the Manhatten world assumption) and all objects are assumed to be aligned with these directions. At each anchor position of the sliding window based approach, the network predicts $19$ different scores corresponding to bounding boxes illustrated in Figure 1. Additionally, the network operates at different scales — for larger scales an additional pooling layer is used to increase the receptive field. Additionally, bounding box regression is performed by predicting the deviation from the predicted, fixed bounding box as in Figure 1. To this end, a multi-task loss is used where bounding box regression is trained using a smooth $L_1$ loss. The overall architecture is illustrated in Figure 2. For training, samples are labeled according to their 3D intersection over union score and each batch contains the positive and negative samples for a specific image. See the paper for the details.

The object recognition network takes the proposals from the region proposal networks and divides the corresponding space into $30 \times 30 \times 30$ voxels (after padding). The voxel grid is used for classification based on the shape. However, Song and Xiao additionally use color information. Therefore, the points in the point cloud contained within the bounding box are backprojected to the image plane and the VGGnet [3] (pre-trained on ImageNet) is used to compute color features which are then fed into the overall object recognition network. This structure is illustrated in Figure 3.

The results look promising, outperforming their prior work called Sliding Shapes [4] by a significant margin. Furthermore, they discuss the region proposal network and show a significant performance boost in contrast to a 3D selective search algorithm (see the paper for details). Qualitative results, and a comparison with Sliding Shapes, can be found in Figure 4.

• [1] N. Silberman, D. Hoiem, P. Kohli, R. Fergus. Indoor segmentation and support inference from RGBD images. ECCV, 2012.
• [2] S. Song, S. Lichtenberg, J. Xiao. SUN RGB-D: A RGBD scene understanding benchmark suite. CVPR, 2015.
• [3] K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, 2014.
• [4] S. Song, J. Xiao. Sliding Shapes for 3D object detection in depth images. ECCV, 2014.