Check out our CVPR'18 paper on weakly-supervised 3D shape completion — and let me know your opinion! @david_stutz


Kihyuk Sohn, Honglak Lee, Xinchen Yan. Learning Structured Output Representation using Deep Conditional Generative Models. NIPS, 2015.

Sohn et al. propose two novel models based on the variational auto encoder framework: conditional variational auto encoders and Gaussian stochastic neural networks. The former adapts the objective of the variational auto encoder, i.e. maximizing the variational lower bound

$\log p_\theta(x) \geq – KL(q_\psi(z|x)|p_\theta(z)) + E_{p_\theta(x|z)}$.

Here, $q_\psi(z|x)$ is supposed to be an approximation to the true posterior $p_\theta(z|x)$ and usually implemented as a neural network predicting a Gaussian distribution depending on $x$. $p_\theta(z)$ is a prior on the latent code $z$ which is usually also Gaussian.

In the conditional variational auto encoder, the goal is to be able to generate an output $y$ from $p_\theta(y|x,z)$; the overall objective changes to maximizing the conditional log-likelihood:

$\log p_\theta(y|x) \geq -KL(q_\psi(z|x,y)|p_\theta(z|x)) + E_{q_\psi(z|x,y)}[log p_\theta(y, x, z)]$

Note that the complete derivation, which basically follows the general derivation of the variational lower bound in the conditional case can be found in the provided supplementary material. Overall, the conditional variational auto-encoder consists of the recongition network $q_\psi(z|x,y)$, the prior network $p_\theta(z|x)$ and the generation network $p_\theta(y|x,z)$.

Sohn et al. further discuss a discrepancy between training and testing of the conditional variational auto encoder. Specifically, at testing time, $z$ is drawn from the prior $p_\theta(z|x)$, but at training time, the recognition network $q_\psi(z|x,y)$ is used. To make prediction during training and testing consistent, they set $q_\psi(z|x,y) = q_\theta(z|x)$. The resulting objective is

$E_{p_\theta(z|x)}[\log p_\theta(y|x,z)]$.

This model is introduced as Gaussian stochastic neural network. Finally, the objectives of both models are combined in a weighted sum.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: