M. Engelcke, D. Rao, D. Zeng Wang, C. H. Tong, I. Posner. Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks. CoRR, 2016.

Engelcke et al. use sparse 3D convolutional neural networks for 3D object detection on the KITTI [1] benchmark. Following earlier work [2], they use sparse convolutions to implement 3D convolutions on sparse 3D data. To this end, they convert the input point cloud into an occupancy grid where each grid cells holds statistics about the underlying points. As this occupancy grid is very sparse, performing regular 3D convolutions is computationally prohibitive. Instead of evaluating the kernel at every location in the grid, they flip the kernel, lay it over every non-zero voxel such that these can "cast votes" for neighboring voxels. This scheme is illustrated in Figure 1. Overall, this scheme highly reduces the computational effort needed for 3D convolutions.

Figure 1 (click to enlarge): Illustration of the voting scheme used to efficiently compute convolutions in sparse data. The example shows the voting scheme applied to sparse 2D grids. Instead of applying the kernel (center left) to every position in the grid, resulting in many zero multiplications, the kernel is flipped (center right) and applied to every non-zero position in the grid (indicated by the two green rectangles, right). The kernel is then used to cast votes regarding the new values of neighboring voxels.

For object detection on KITTI, they use a fixed size bounding box for each category (e.g. pedestrian, vehicle, ciclyst etc.). For each category, a binary classifier is used — represented by comparably shallow 3D convolutional networks as illustrated in Figure 2. Each sparse convolutional layer is followed by rectified linear units in order to preserve sparsity. Furthermore, biases used in the convolutional layers are constrained to be negative. Training is done using the hinge loss, including weight decay and a $L_1$ regularizer for sparsity. The model is trained on an augmented set of positive and negative examples by randomly rotating and translating them. Every no and then, hard negatives are mined and added to the training set.

Figure 2 (click to enlarge): Summary of the evaluated models. They use different models for different classes, however, overall the models are comparably shallow.

The performance is compared to other state-of-the-art methods, including their earlier work [2], on the KITTI test set and demonstrates significantly improved accuracies.

  • [1] A. Geiger, P. Lenz, R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite CVPR, 2012.
  • [2] D. Z. Wang, I. Posner. Voting for Voting in Online Point Cloud Object Detection. Robotics Science and Systems, 2015.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.