Tatarchenko et al. introduce Octree Generating Networks (OGN) which can be understood as the deconvolutional extension to general Octree Networks (or OctNets) . Similar to OctNets, the main idea is to represent high-resolution 3D voxel grids using an octree structure, i.e. subdivide the space into cubes of different sizes, organized in a tree. A 2D illustration can also be found in Figure 1.
Figure 1: High-level overview of a single OGN block consisting of convolution, state prediction (“empty”, “filled”, “mixed”), loss computation and “mixed” block propagation.
In general, octrees are implemented using pointers, i.e. a block at level $l$ contains pointers to the contained blocks at level $l + 1$, if the block is subdivided. However, Tatarchenko, similar to Riegler et al. , choose to organize the octree in a hash-table for efficient access. Then, as in Figure 1, they define three different operations: a convolution of OGNs, a loss on OGNs and a propagation scheme between levels. In general, an OGN starts by predicting the coarsest level using a convolution implemented on octree. Then, the generated octree is compared to the ground truth octree after predicting the state of the blocks – i.e. empty, filled or mixed. The first two states correspond to an occupancy value (0 or 1) while the latter means that this block is refined in later layers. The loss can be defined on the classification of the individual blocks. The blocks classified as “mixed” are further refined. A propagation layer allows to propagate the “mixed” blocks as well as it neighbors and the process is repeated at a more detailed level. This last step can either be guided by the ground truth octree (known shape) or the predicted octree (unkown shape).
Unfortunately, Tatarchenko et al. do not go into detail regarding the implemented – in my opinion it might be very hard to reimplement their approach (although supplementary material is provided). However, the presented experimental results are promising, similar to Riegler et al. They show a significant decrease in runtime and memory consumption compared to dense voxel grids. On auto-encoding tasks, e.g. on the ShapeNet-cars dataset, they give qualitative and quantitative results.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: