27^{th}SEPTEMBER2017

Gernot Riegler, Ali Osman Ulusoy, Andreas Geiger. *OctNet: Learning Deep 3D Representations at High Resolutions*. CoRR, 2016.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below:

Riegler et al. present a network architecture called OctNet allowing to train deep networks on sparse 3D data in high resolution. The approach is based on the simple observation that 3D data is usually sparse, for example when considering point clouds or 3D shapes. This implies that, at high resolutions, the information of most voxels is useless. The idea of Riegler et al. is to automatically adapt the resolution according to the information present in the 3D data. On the image plane, this idea is illustrated in Figure 1.

Figure 1: Illustration of the adaptive resolution (bottom) versus fixed resolution (top).

In practice, Riegler et al. utilize octrees — octrees with maximum depth 3. Then a 3D tensor of size $H\times W\times D$ is represented by $\frac{H}{8}\times\frac{W}{8}\times\frac{D}{8}$ octrees, where each octree can represent 512 voxels in the original resolution. For example, considering 3D shape recognition where the 3D tensor contains only $0$s and $1$s, the highest resolution is only used along the edges of the shape, while the coarser resolution is used outside the shape as illustrated in Figure 1. Riegler et al. discuss the details of implementing this efficiently, i.e. such that each cell can be accessed directly.

After converting the input tensor, they discuss how the most common operations used within convolutional neural networks can efficiently be performed: convolution and pooling. For convolution, the discussion is relatively easy to follow, but very technically. However, the main source for speed up is illustrated in Figure 2. The rectangle illustrated in bold black refers to a single octree cell at depth $0$, i.e. containing $512$ voxels of the original resolution. Instead of convolving the individual $512$ voxels, we only need to perform convolution around the edges of the octree cell — the value of the convolution within the octree cell is only computed once for the whole cell.

Figure 2: Illustration of efficient convolution on the proposed octree data structure. A single octree cell at depth $0$ is illustrated by the bold black rectangle. Instead of performing convolution over the individual voxels within the octree-cell, we only need to perform regular convolution around the edges of the octree-cell as the convolution within the octree-cell can be computed directly.

Similarly, pooling can be performed on the proposed data structure. The only caveat is that the depth of the individual octrees is afterwards reduced by one. Unfortunately, Riegler et al. do not give details on how this influences further processing or how to exactly avoid this problem. Unpooling, used for the point cloud labeling experiments, works analogously with the same caveat (assuming that the structure of the octree is already known from the input!).

Through experiments, Riegler et al. demonstrate impressive performance even for resolutions up to $256\times256\times256$, while most other approaches use resolutions in the order of $32\times32\times32$.

A Torch implementation of the proposed OctNet can be found on GitHub: griegler/octnet.