# DAVIDSTUTZ

02ndMAY2017

Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, Abhinav Gupta. Learning a Predictable and Generative Vector Representation for Objects. CoRR, 2016.

Girdhar et al. present the so-called TL-embedding network, a combination of a 3D auto-encoder to reconstruct voxel grid and a AlexNet-like [] network to infer the voxel grid from 2D images. Their main motivation is two address two questions:

• How to learn a generative representation in 3D?
• Can this representation be predicted from 2D?

The proposed architecture is depicted in Figure 1 and consists of a 3D auto-encoder learning a $64$-dimensional representation. The auto-encoder consists of several convolutional and deconvolutional layers, the details can be found in the figure. Although not explicitly discussed, they predict occupancy grid and measure error using a voxel-wise cross-entropy loss. For prediction from 2D, they use a AlexNet-like architecture (using the pre-trained weights), taking an image as input and predicting the 64-dimensional representation learned by the auto-encoder.

For training, they generate data from CAD models by rendering them in front of random background taken from the internet. Training is done in three stages. First, the auto-encoder is trained separately. Then, the AlexNet is trained to regress the learned $64$-dimensional representation (while keeping the auto-encoder fixed). Finally, both models are fine-tuned jointly.

They provide several qualitative and quantitative experimental results. Reconstruction results using the auto-encoder are shown in Figure 2, compared to a PCA baseline. Regarding the first question, they conduct experiments regarding the smoothness and interpretability of the learned representations. Figure 3 shows examples where two representations are interpolated and Figure 5 showsn examples of adapting individual dimensions. They also provide experiments regarding shape retrieval from images, see the paper for details.

• [] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. pp. 1097–1105 (2012).