Socher et al. use an ensemble of random (not trained) recursive neural network for 3D object classification. Their work is often cited by recent approaches [1,2,3] to 3D shape recognition as one of the first approaches applying ideas and methods from convolutional neural networks to 3D object recognition. However, note that in contrast to the volumetric approaches, Socher et al. treat the depth of RGB-D images as additional channel. Therefore, the approach might be better described as 2.5D object classification.
The overall architecture is summarized in Figure 1 — $K$ convolutional filters are learned in an unsupervised manner using the approach presented in . To this end, patches are extracted from the RGB and the depth channels, mean subtracted, normalized by the standard deviation and finally whitened. Filters are then learned using $k$-means. On the pooled features obtained form these filters, Socher et al. apply several random recursive neural networks which apply the same weight matrix three times to obtain higher-level features. Finally a Softmax classifier is learned on the concatenated features.
Figure 1 (click to enlarge): High-level view of the proposed approach. Both the RGB and the depth channels are separately convolved by $K$ learned filters. The responses are subsequently pooled. These pooled features are merged and processed by recursive neural networks to learn higher-level features. The recursive neural networks consist of random weights that are not learned. Only the Softmax classifier on top of these features is learned by backpropagation.
Interestingly, Socher et al. minimize the training effort to a minimum and also resort to unsupervised training for low-level features. Still, they are able to present superior performance to other approaches which are mostly based on hand-crafted features combined with kernel machines or random forests. They also evaluate several of their design decisions; they show improved performance over using a single, trained recursive neural network instead of the ensemble of random recursive neural networks and using a trained convolutional neural network instead of the $k$-means learned filters.