Girdhar et al. present the so-called TL-embedding network, a combination of a 3D auto-encoder to reconstruct voxel grid and a AlexNet-like  network to infer the voxel grid from 2D images. Their main motivation is two address two questions:
The proposed architecture is depicted in Figure 1 and consists of a 3D auto-encoder learning a $64$-dimensional representation. The auto-encoder consists of several convolutional and deconvolutional layers, the details can be found in the figure. Although not explicitly discussed, they predict occupancy grid and measure error using a voxel-wise cross-entropy loss. For prediction from 2D, they use a AlexNet-like architecture (using the pre-trained weights), taking an image as input and predicting the 64-dimensional representation learned by the auto-encoder.
Figure 1 (click to enlarge): Illustration of the network architecture. During training (T-Network) an auto-encoder and a AlexNet-like ConvNet are trained jointly. During testing (L-Network), the representations predicted by the AlexNet are fed into the decoder of the auto-encoder to predict a voxel grid.
For training, they generate data from CAD models by rendering them in front of random background taken from the internet. Training is done in three stages. First, the auto-encoder is trained separately. Then, the AlexNet is trained to regress the learned $64$-dimensional representation (while keeping the auto-encoder fixed). Finally, both models are fine-tuned jointly.
They provide several qualitative and quantitative experimental results. Reconstruction results using the auto-encoder are shown in Figure 2, compared to a PCA baseline. Regarding the first question, they conduct experiments regarding the smoothness and interpretability of the learned representations. Figure 3 shows examples where two representations are interpolated and Figure 5 showsn examples of adapting individual dimensions. They also provide experiments regarding shape retrieval from images, see the paper for details.
Figure 2 (click to enlarge): Shape interpolation by interpolating between two fixed representations and using the decoder to predict a voxel grid.
Figure 3 (click to enlarge): Illustration of adapting individual dimensions in the learned representation.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or using the following platforms: