Wu et al. present 3D ShapeNets - based on convolutional deep belief networks, they tackle the problem of shape recognition and retrieval. They also introduce the ModelNet dataset consisting of roughly 150k CAD models of 660 categories. For evaluation, however, they use 10-category or 40-category subset (this is also used in several related publications [1,2]). Examples are shown in Figure 1.
Figure 1 (click to enlarge): 3d shape models included in the ModelNet dataset include various categories such as window, aircraft, shelf, truck, fence, coffee table etc.
The used deep belief net architecture is summarized in Figure 2 and consists primarily of 3 convolutional layers. While background on deep belief nets can for example be found in , the energy of a convolutional layer is given by
$E(v,h) = -\sum_f \sum_j (h_j^f(W^f \star v)_j + c^fh_j^f) - \sum_l b_l v_l$
where $v_l$ refer to the visible units, $h_j^f$ refer to the hidden units in a given feature channel $f$ and $W^f$ is the convolution kernel for channel $f$. They also include a stride in the convolutional layer, see Figure 2.
Figure 2 (click to enlarge): Illustration of the used deep belief network. Three convolutional layer are followed by two fully connected layers. The number and size of the used kernels is included. The model is trained on $24 \times 24 \times 24$ models which are padded up to $30 \times 30 \times 30$.
The training procedure is split in pre-training and fine-tuning. For pre-training, the three convolutional layers and the fully connected layers are trained using contrastive divergence. For the top layer, fast persistent contrastive divergence is used. The procedure proceeds layer by layer, i.e. as soon as the weights in the lower level are trained, they are fixed and the next layer is trained. Fine-tuning is based on a wake-sleep similar algorithm.
They discuss two applications in detail, 3d reconstruction from 2.5d (i.e. RGBD images) and next-best-view prediction. In the later problem, given a single view point, the task is to predict the best next view in order to reduce uncertainty regarding the category. For the former problem, a voxel cloud is constructed from the given RGBD image. All unobserved voxels (i.e. lying behind the observed surface) are treated as missing information. To recover both the missing voxels and the category, Gibbs sampling is used to approximate the posterior $p(y|x_o)$ where $y$ is the category and the input $x$ is splitted in observed voxels $x_o$ and unobserved voxels $x_u$. Initially setting $x_u$ to random values, these are propagated up to sample the category $y$. The sampled high-level signal is propagated down to sample the unobserved voxels $x_u$. This process is repeated up to $50$ times. Results are shown in Figure 3.
Figure 3 (click to enlarge): Examples of shape generation for some categories (left); examples of shape completion (right).