Diederik P. Kingma, Max Welling. Auto-Encoding Variational Bayes- CoRR, abs/1312.6114.

Kingma and Welling propose variational auto-encoders. The motivation is to perform efficient learning of directed models with continuous latent variables resulting in an intractable posterior distribution. Concretely, they consider cases where

  1. The marginal likelihood $p_\theta (x) = \int p_\theta(z)p_\theta(x|z) dz$ is intractable such that the marginal likelihood cannot be differentiated, the true posterior $p_\theta(z|x) = \frac{p_\theta(x|z)p_\theta(z)}{p_\theta(x)}$ is intractable such that EM-based algorithms are not applicable, and mean-field approaches are intractable.
  2. Learning can be performed on mini-batches only because the dataset is too large.

Similar in spirit to mean field approaches, they consider a model distribution $q_\phi (z|x)$ and consider the variational lower bound:

$\log p_\theta (x^{(i)}) \geq \mathcal{L}(\theta,\phi;x^{(i)}) = E_{q_\phi(z|x)}[-\log q_\phi(z|x) + \log p_\theta(x,z)]$

where $x^{(i)}$ denotes a data point form an i.i.d. dataset. Using Bayes' theorem, the lower bound can be rewritten as:

$\mathcal{L}(\theta,\phi;x^{(i)}) = - D_{\text{KL}}(q_\phi(z|x^{(i)})|p_\theta(z)) + E_{q_\phi(z|x^{(i)})}[\log p_\theta (x^{(i)}|z)]$

In order to optimize the lower bound, we need to differentiate it with respect to both parameters $\theta$ and $\phi$. However, differentiation with respect to $\phi$ is problematic and usual Monte Carlo gradient estimators have high variance.

Kingma and Welling propose to apply a simple reparameterization trick. In particular, they express $z$ as

$z = g_\phi(\epsilon,x)$ with $\epsilon \sim p(\epsilon)$

Where $g_\phi$ is a deterministic function and $p(\epsilon)$ a prior distribution of the auxiliary (noise) variable $\epsilon$. They give several examples on cases where this parameterization works, see the paper for details. The proposed Stochastic Gradient Variational Bayes (SGVB) estimator is then given by (applying Monte Carlo estimation using the above reparameterization on the lower bound):

$\tilde{\mathcal{L}}^A(\theta,\phi;x^{(i)}) = \frac{1}{L} \sum_{i = 1}^L \log p_\theta(x^{(i)}, z^{(i,l)}) - \log q_\phi(z^{(i,l)} | x^{(i)})$

with $z^{(i,l)} = g_\phi(\epsilon^{(i,l)},x^{(i)})$ and $\epsilon^{(l)} \sim p(\epsilon)$

The KL-divergence does not need to be estimated this way in case it is available analytically. This estimator is then applied on mini-batches from a dataset. As example, they introduce the variational auto-encoder which is later used to generate data from the MNIST dataset and the Frey Faces dataset. In this case, the model distribution $q_\phi (z|x)$ is modeled by a neural network. In their case the neural network computes mean and standard deviation of a multivariate Gaussian (assuming that $z$ is continuous). Similarly, $p_\theta (x|z)$ is modeled using a neural network with the same idea. Details can be found in the appendix.

The optimization scheme is then summarized in Algorithm 1.

function auto_encoding_variational_bayes($M$, $L$)
    initialize $\theta$, $\phi$
    for $t = 1,\ldots$
        sample random mini-batch $x^{(1)}, \ldots, x^{(M)}$
        sample $\epsilon \sim p(\epsilon)$
        // $N$ is the size of the dataset:
        $g = \nabla_{\theta,\phi} \mathcal{L}^M(\theta, \phi; x^{(1)}, \ldots, x^{(M)}, \epsilon)$ with $\mathcal{L}^M(\theta, \phi; x^{(1)}, \ldots, x^{(M)}, \epsilon) = \frac{N}{M}\sum_{i = 1}^M \mathcal{L}(\theta,\phi;x^{(i)})$
       update $\theta$, $\phi$ according to SGD or similar algorithms
    return $\theta$, $\phi$

Algorithm 1: Mini-batch version of the variational auto-encoder presented by Kingma and Welling.

Compared to the wake-sleep algorithm, they show superior optimization regarding the approximated lower bound. More interestingly, however, they present examples of generated MNIST samples. For example, Figure 1 shows generated samples when using a two-dimensional latent variable $z$ and organizing the generated samples accordingly.


Figure 1 (click to enlarge): Visualizations of generated samples for a two-dimensional latent variable $z$.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.