# DAVIDSTUTZ

JANUARY2017

Daniel Maturana, Sebastian Scherer. VoxNet: A 3D Convolutional Neural Network for real-time object recognition. IROS, 2015.

Maturana and Scherer, building partly on their work in , present VoxNet, a 3D convolutional neural network for object/shape recognition. While the presented model is a simple generalization from 2D convolutional neural networks to the 3-dimensional domain of CAD, LiDAR and RGBD data, the paper presents an excellent introduction and baseline for the topic of 3D object recognition from a deep learning perspective. For example, they discuss three different occupancy grid models used as representation of the data, problems concerning rotational invariance around the z-axis and provide an evaluation and comparison to 3D ShapeNets .

As occupancy grid models, they propose to use 3D ray tracing and present the following different representations:

• Binary occupancy grids are based on the discussion by Thrun in  where the occupancy of a position $l_{ijk}$ is modeled probabilistically given the sensor measurements $z^1,…,z^t$ as $p(l_{ijk} |z^1,…,z^t)$. The update equation for $l_{ijk}^t$ for measurement $t$ is then given by

$l_{ijk}^t = l_{ijk}^{t - 1} + z^t l_{occ} + (1-z^t)l_{free}$

with $z^t = 1$ if the voxel is hit and $z^t = 0$ if the measurement passes through the voxe. The constants $l_{occ}$ and $l_{free}$ are given by $1.38$ and $-1.38$.
• The density grids assigns each voxel a continuous density, as detailed in , using the update equations

$\alpha_{ijk}^t = \alpha_{ijk}^{t-1} + z^t$

$\beta_{ijk}^t = \beta_{ijk}^{t - 1} + (1 - z^t)$

and the occupancy estimate

$\mu_{ijk}^t = \frac{\alpha_{ijk}^t}{\alpha_{ijk}^t + \beta_{ijk}^t}$

• The hit grid merely counts the number of hits (with a minimum of $1$ hit per voxel). This model discards the difference between free space and unobserved space, but Maturana and Scherer report surprisingly good results using this occupancy grid model.

Based on a occupancy grid as input to the 3d convolutional neural network, they use an architecture consisting of two convolutional layers, a pooling layer and two fully connected layers. The architecture is summarized in Figure 1. Figure 1 (click to enlarge): Used architecture consisting of two convolutional layers, starting with a resolution of $32^3$, followed by a $2^3$ pooling layer and two fully connected layers.

The training procedure addresses the problem of rotation invariance. The motivation is that it is not trivial to maintain a consistent object orientation relative to the z-axis, while the z-axis itself is assumed to be aligned with the direction of gravity. Therefore, the training set is augmented by several copies of the same model rotated around the z-axis. Experiments show that this approach works well. At testing time, the model is also rotated and the predictions are pooled, resulting in a scheme similar to voting. Additional randomly perturbed as well as mirrored copies further help training.

Experiments are presented on three datasets: a LiDAR dataset, the ModelNet dataset  and the NYU Depth Dataset . The approach is compared to the 3D ShapeNet of Wu et al.  and shown to outperform 3D ShapeNet with considerably less parameters on all tasks except the cross-domain task where the model is trained on a different dataset than it is evaluated on.

•  D. Maturana, S. Scherer. 3D convolutional neural networks for landing zone detection from lidar. ICRA, 2015.
•  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, J. Xiao. 3d shapenets: A deep representation for volumetric shape modeling. CVPR, 2015.
•  S. Thrun. Learning occupancy grid maps with forward sensor models. Auton. Robots, 2003.
•  D. Hähnel, D. Schulz, W. Burgard. Map building with mobile robots in populated environments. IROS, 2002
•  P. K. Nathan Silberman, Derek Hoiem, R. Fergus. Indoorsegmentation and support inference from rgbd images. ECCV, 2012.