Daniel Maturana, Sebastian Scherer. VoxNet: A 3D Convolutional Neural Network for real-time object recognition. IROS, 2015.

Maturana and Scherer, building partly on their work in [1], present VoxNet, a 3D convolutional neural network for object/shape recognition. While the presented model is a simple generalization from 2D convolutional neural networks to the 3-dimensional domain of CAD, LiDAR and RGBD data, the paper presents an excellent introduction and baseline for the topic of 3D object recognition from a deep learning perspective. For example, they discuss three different occupancy grid models used as representation of the data, problems concerning rotational invariance around the z-axis and provide an evaluation and comparison to 3D ShapeNets [2].

As occupancy grid models, they propose to use 3D ray tracing and present the following different representations:

  • Binary occupancy grids are based on the discussion by Thrun in [3] where the occupancy of a position $l_{ijk}$ is modeled probabilistically given the sensor measurements $z^1,…,z^t$ as $p(l_{ijk} |z^1,…,z^t)$. The update equation for $l_{ijk}^t$ for measurement $t$ is then given by

    $l_{ijk}^t = l_{ijk}^{t - 1} + z^t l_{occ} + (1-z^t)l_{free}$

    with $z^t = 1$ if the voxel is hit and $z^t = 0$ if the measurement passes through the voxe. The constants $l_{occ}$ and $l_{free}$ are given by $1.38$ and $-1.38$.
  • The density grids assigns each voxel a continuous density, as detailed in [4], using the update equations

    $\alpha_{ijk}^t = \alpha_{ijk}^{t-1} + z^t$

    $\beta_{ijk}^t = \beta_{ijk}^{t - 1} + (1 - z^t)$

    and the occupancy estimate

    $\mu_{ijk}^t = \frac{\alpha_{ijk}^t}{\alpha_{ijk}^t + \beta_{ijk}^t}$

  • The hit grid merely counts the number of hits (with a minimum of $1$ hit per voxel). This model discards the difference between free space and unobserved space, but Maturana and Scherer report surprisingly good results using this occupancy grid model.

Based on a occupancy grid as input to the 3d convolutional neural network, they use an architecture consisting of two convolutional layers, a pooling layer and two fully connected layers. The architecture is summarized in Figure 1.


Figure 1 (click to enlarge): Used architecture consisting of two convolutional layers, starting with a resolution of $32^3$, followed by a $2^3$ pooling layer and two fully connected layers.

The training procedure addresses the problem of rotation invariance. The motivation is that it is not trivial to maintain a consistent object orientation relative to the z-axis, while the z-axis itself is assumed to be aligned with the direction of gravity. Therefore, the training set is augmented by several copies of the same model rotated around the z-axis. Experiments show that this approach works well. At testing time, the model is also rotated and the predictions are pooled, resulting in a scheme similar to voting. Additional randomly perturbed as well as mirrored copies further help training.

Experiments are presented on three datasets: a LiDAR dataset, the ModelNet dataset [2] and the NYU Depth Dataset [5]. The approach is compared to the 3D ShapeNet of Wu et al. [26] and shown to outperform 3D ShapeNet with considerably less parameters on all tasks except the cross-domain task where the model is trained on a different dataset than it is evaluated on.

  • [1] D. Maturana, S. Scherer. 3D convolutional neural networks for landing zone detection from lidar. ICRA, 2015.
  • [2] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, J. Xiao. 3d shapenets: A deep representation for volumetric shape modeling. CVPR, 2015.
  • [3] S. Thrun. Learning occupancy grid maps with forward sensor models. Auton. Robots, 2003.
  • [4] D. Hähnel, D. Schulz, W. Burgard. Map building with mobile robots in populated environments. IROS, 2002
  • [5] P. K. Nathan Silberman, Derek Hoiem, R. Fergus. Indoorsegmentation and support inference from rgbd images. ECCV, 2012.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: