Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, Max Welling. Semi-supervised Learning with Deep Generative Models. NIPS, 2014.

Kingma et al. propose a conditional variant of the variational auto-encoder for semi-supervised learning. A variatinal auto-encoder learns a latent representation $z$ corresponding to the data sample $x$. To this end, an inference network $q_psi(z|x)$ is used to approximate the posterior distribution $p(z|x)$. In the discussed case, $q_\psi(z|x) = \mathcal{N}(z|\mu_\psi(x), diag(\sigma^2_\psi(x)))$ where $\mu_\psi$ and $\sigma^2_\psi$ are modeled using a neural network. Following the variational principle, the lower bound on the marginal likelihood of a data point $x$ is given by

$\log p_\theta(x) \geq E_{q_\psi(z|x)}[log p_\theta(x|z)] – KL(q_psi(z|x)|p_\theta(z))$ (1)

where $p_\theta(z)$ is usually a unit Gaussian, such that the Kullback-Leibler divergence can be determined analytically. $p_\theta(x|z)$ is usually a multivariate Gaussian or a Bernoulli distribution whose parameters are computed by another neural network. This allows to easily evaluate $E_{q_\psi(z|x)}[log p_\theta(x|z)]$ as loss on top of the neural network.

In the semi-supervised case, i.e. if labels are only available for a small subset of the data, Kingma et al. Propose two different approaches. First, a variational auto-encoder as described above is learned on the labeled data and the inference network $q_\psi(z|x)$ is used as feature extractor. Or, second, a conditional variational auto-encoder is trained on all data. Then, there are two cases: either the label $y$ of a given samples $x$ is available or not. In both cases, the model is extended as follows. The generator network $p_\theta(x|y,z)$ now depends on both the latent space and the label. The inference model $q_\psi(z,y|x)$ decomposes into $q_\psi(z,y|x) = q_\psi(z|x) q_\psi(y|x)$ where $q_\psi(y|x)$ is specified as multinomial distribution:

$q_\psi(z|y,x) = \mathcal{N}(z|\mu_\psi(y,x), diag(\sigma^2_\psi(x)))$,

$q_\psi(y|x) = Cat(y|\pi_\psi(x))$.

Then, in the first case, the variational lower bound can be written as follows:

$\log p_\theta(x,y) \geq E_{q_\psi(z|x,y)}[\log p_\theta(x|y,z) - \log p_\theta(y) + \log p(z) - \log q_\psi(z|x,y)]$(2)

which follows from directly from Equation (1) by considering the new model. In the second case, the label is treated as missing latent variable:

$\log p_\theta(x) \geq E_{q_\psi(y,z|x)}[\log p_\theta(x|y,z) + \log p_\theta(y) + \log p(z) - \log q_\psi(y,z|x)] = (\ast)$.

Then, posterior inference can be performed. Let $\mathcal{L}$ be the negative of the right hand side of Equation (2), i.e. the lower bound if the label $y$ is known. It follows:

$(\ast) = \sum_y – q_\psi(y|x) \mathcal{L}(x, y) + \mathcal{H}(q_\psi(y|x))$(3)

where $\mathcal{H}$ denotes the entropy. This can be seen when noting that $q_\psi(y,z|x) = q_\psi(y|x)q_\psi(z|x)$ and substituting $\mathcal{L}(x,y)$. The variational bound for the whole dataset is the combination of Equations (2) and (3). However, $q_\psi(y|x)$ only takes part in Equation (3), thus, $q_\psi(y|x)$ is only learned on unlabeled data. This is undesirable as labeled data should also be used. Therefore, Kingma et al. Add a classification loss $E_{\widetilde{p}(x,y)}[-\log q_\psi(y|x)]$ where $\widetilde{p}(x,y)$ denotes the distribution of the labeled data. This classification loss can additionally be weighted – Kingma et al. Use $0.1 \cdot N$ as weight where $N$ is the total size of the dataset.

Training follows the discussion in [1], i.e. the training procedure for the general variational auto-encoder. Unfortunately, the gradients for the conditional variational auto-encoder are omitted. I would have expected a more detailed discussion regarding the newly introduced terms in Equations (2) and (3) for training.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.