COMPUTERVISION RESEARCHSCIENTIST

RESEARCHSCIENTIST

18^{th}MARCH2017

A. Brock, T. Lim, J. M. Ritchie, N. Weston. *Generative and Discriminative Voxel Modeling with Convolutional Neural Networks*. CoRR, 2016.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below or get in touch with me:

Brock et al. discuss voxel-based variational auto-encoders for shape reconstruction as well as deep 3D convolutional neural networks for shape classification. Regarding shape reconstruction, the variational auto-encoder architecture is illustrated in Figure 1 including the corresponding filter sizes. They make use of exponential linear units [1] and Xavier initialization [2]. Instead of pooling, they use strided convolution for downsampling and fractionally strided convolution [3] for upsampling. The loss function is augmented to better fit the task of reconstructing very sparse 3D occupancy grids. To this end, they augment the binary cross entropy per pixel to weight false positives and false negatives differently:

$\mathcal{L} = - \gamma t \log(o) - (1 - \gamma)(1 - t) \log(1 - o)$

where $t$ is the target, $o$ the output and the weight $\gamma$ is set to $0.97$. Furthermore, the target $t$ is re-scaled to lie in $\{-1,2\}$ and the output is re-scaled to $[0.1,1)$. The intention is to avoid vanishing gradients during training. Still, these numbers look rather random — no theoretical or experimental validation is provided. The model is trained on augmented data including horizontal flips, random translations and noise. In experiments, they validate that the modified binary cross entropy aids training. Reconstruction and interpolation results are shown in Figure 2.

Figure 1 (

click to enlarge): Architecture of the used variational auto-encoder for 3D shape reconstruction.Figure 2 (

click to enlarge): Reconstruction and interpolation results.For 3D shape classification, Brock et al. use deep 3D convolutional networks combining many interesting techniques with the newly proposed Voxeption blocks. The overall architecture is shown in Figure 3 and discussed in the following. DS describes a Voxception Downsampling block as illustrated in Figure 4 (left). The idea is to let the network decide which downsampling approach is most useful for the task. Thus, the block concatenates downsampled versions of the input feature maps where downsampling is performed using max pooling, average pooling or different convolutional layers with stride. VRN describes a Voxception ResNet block, illustrated in Figure 4 (right). The intention is to give the network the possibility to choose between different convolutional layers, in this case $1 \times 1 \times 1$ vs. $3 \times 3 \times 3$. This approach is combined with the design of residual units [4]. In the spirit of [5], they drop the non-residual connections of the network with varying probability, see the paper for details. The overall network architecture consists of four blocks all containing a VRN and a DS block, followed by a final residual convolutional layer, a pooling layer and a fully connected layer.

For training, Brock et al. change the binary voxels to lie in $\{-1,5\}$ to encourage better training and adapt the learning rate manually based on the validation loss. The training set is augmented using random, horizontal flipping, translation and incorporating different rotations of each sample. The model is initialized on a training set augmented with 12 rotations and fine-tuned on 24 orientations per sample. For testing, they use a small ensemble, which significantly boosts performance and outperforms the compared state-of-the-art.

Figure 3 (

click to enlarge): Illustration of the overall architecture used for 3D shape classification. The architecture mainly consists of 4 blocks comprising so-called Voxceptiom ResNet blocks (VRN) and Voxception Downsampling blocks (DS), see the text for details.Figure 4: Illustration of the Voxception Downsampling (DS) block and the Voxception ResNet (VRN) block.

Fast and accurate deep network learning by exponential linear units (elus). ICLR, 2016.Understanding the difficulty of training deep feedforward neural networks. AISTATS, 2010.A guide to convolution arithmetic for deep learning. CoRR, 2016.Deep residual learning for image recognition. CVPR, 2016.Deep networks with stochastic depth. CoRR, 2016.