Su et al., also motivated by the work from Wu et al. , proposes a simple convolutional neural network architecture to fuse information from different views to tackle 3D shape recognition. Their approach is relatively simple. Based on the architecture from , they introduce a view pooling layer after the last convolutional layer. Different views, generated by perturbing the viewpoint relative to the 3D model, are fed to the network sequentially. The forward passes are collected at the view pooling layer. Finally, the maximum activations across all views are taken to continue training (i.e. the last fully connected layers). This way, they learn multi-view features that are later used for classification or retrieval. This approach is illustrated in Figure 1.
Experiments show that this approach outperforms the 3D ShapeNets proposed by Wu et al.  as well as several baselines based on geometric hand-crafted features. They also try to reason why this simple approach outperforms convolutional neural networks applied directly to the volume. In particular, they account this difference in performance to the low resolution used for volumetric convolutional neural networks (usually around $32^3$).
Interestingly, they also apply their approach to sketch recognition. While this does not allow to feed multiple views to the network, they instead feed jittered versions of each training sample to the network, these jittered versions then correspond to the individual views in the 3D application. While the performance increas is not as significant as in 3D shape recognition, they are still able to improve the performance by 1% to 2% classification accuracy. However, for the reader it is unclear whether this is due to the introduced view pooling layer or can be accounted to the increased size of the data set similar to traditional data augmentation.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: