Charles Ruizhongtai Qi. Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, Leonidas Guibas. Volumetric and Multi-View CNNs for Object Classification on 3D Data. CVPR, 2016.

Qi et al. study how to improve both volumetric convolutional neural networks (CNNs) and multi-view CNNs for 3D shape recognition. In particular, they study the performance gap between these approaches, i.e. volumetric CNNs usually demonstrate inferior performance compared to multi-view CNNs:

"Intuitively, a volumetric representation should encode as much information, if not more, than its multi-view counterpart. However, experiments indicate that multiview CNNs produce superior performance in object classification."

According to Qi et al., this is due to two factors:

  • Lower resolution used in volumetric CNNs - usually $30 \times 30 \times 30$ vs. $227 \times 227$ for muli-view CNNs.
  • Network architecture differences.

The second argument is motivated by the observation that multi-view CNNs still perform significantly better than volumetric CNNs even in low resolution such as $30 \times 30$. Beneath an extensive evaluation, Qi et al. make the following contributions:

  • Introducing auxiliary learning tasks to prevent overfitting of volumetric CNNs.
  • Using orientation pooling and data augmentation for volumetric CNNs.
  • Introducing anisotropic kernels to probe the volume; several layers with anisotropic kernels implicitly generate multiple "learned" projections of the volume.

They also conducted experiments regarding the influence of resolution on performance, however, they merely experiment with $10 \times 10 \times 10$ and $30 \times 30 \times 30$. As discussed in [1], this is still relatively coarse.

Qi et al. introduce two new network architectures which are both categorized as volumetric CNNs. First, after subsampling the volume, auxiliary tasks to predict the class based on a subvolume are defined. This is done by slicing the obtained volume after the last subsampling/pooling step. Each of these auxiliary tasks tries to predict the class based on roughly $\frac{2}{3}$ of the original volume. As these tasks are considerably more difficult than the original task, this prevents the network from overfitting. The model is illustrated in Figure 1.


Figure 1 (click to enlarge): Illustration of the network architecture after introducing the auxiliary tasks. Mlpconv corresponds to a network-in-network layer as described in [2]. As can be seen, after subsampling the volume, it is sliced and fed into separate fully-connected networks for prediction.

Second, large anisotropic kernels are used to project the volume onto a single image. This way, instead of using multi-view CNNs where the individual views are pre-computed based on fixed angles, the CNN is able to learn the most appropriate projection for the task. The architecture is illustrated in Figure 2. After the projection, the image is classified using the model described in [2]. Both presented architectures are trained using orientation pooling, aggregating information from different orientations.


Figure 2 (click to enlarge): Network architecture based on anisotropic probing to learn appropriate projections.

Unfortunately, key details regarding the anisotropic filters and the orientation pooling are omitted. In particular, it is unclear how the anisotropic filters are different from regular filtering in volumes (except for the $1 \times 1$ spatial extend of the filters).

  • [1] Yangyan Li, Sören Pirk, Hao Su, Charles Ruizhongtai Qi, Leonidas J. Guibas. FPNN: Field Probing Neural Networks for 3D Data. CoRR, abs/1605.06240.
  • [2] Min Lin, Qiang Chen, Shuicheng Yan. Network In Network. CoRR, abs/1312.4400.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.