12^{th}MARCH2017

Diederik P. Kingma, Max Welling. *Auto-Encoding Variational Bayes*- CoRR, abs/1312.6114.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below or get in touch with me:

Kingma and Welling propose variational auto-encoders. The motivation is to perform efficient learning of directed models with continuous latent variables resulting in an intractable posterior distribution. Concretely, they consider cases where

Similar in spirit to mean field approaches, they consider a model distribution $q_\phi (z|x)$ and consider the variational lower bound:

$\log p_\theta (x^{(i)}) \geq \mathcal{L}(\theta,\phi;x^{(i)}) = E_{q_\phi(z|x)}[-\log q_\phi(z|x) + \log p_\theta(x,z)]$

where $x^{(i)}$ denotes a data point form an i.i.d. dataset. Using Bayes' theorem, the lower bound can be rewritten as:

$\mathcal{L}(\theta,\phi;x^{(i)}) = - D_{\text{KL}}(q_\phi(z|x^{(i)})|p_\theta(z)) + E_{q_\phi(z|x^{(i)})}[\log p_\theta (x^{(i)}|z)]$

In order to optimize the lower bound, we need to differentiate it with respect to both parameters $\theta$ and $\phi$. However, differentiation with respect to $\phi$ is problematic and usual Monte Carlo gradient estimators have high variance.

Kingma and Welling propose to apply a simple reparameterization trick. In particular, they express $z$ as

$z = g_\phi(\epsilon,x)$ with $\epsilon \sim p(\epsilon)$

Where $g_\phi$ is a deterministic function and $p(\epsilon)$ a prior distribution of the auxiliary (noise) variable $\epsilon$. They give several examples on cases where this parameterization works, see the paper for details. The proposed Stochastic Gradient Variational Bayes (SGVB) estimator is then given by (applying Monte Carlo estimation using the above reparameterization on the lower bound):

$\tilde{\mathcal{L}}^A(\theta,\phi;x^{(i)}) = \frac{1}{L} \sum_{i = 1}^L \log p_\theta(x^{(i)}, z^{(i,l)}) - \log q_\phi(z^{(i,l)} | x^{(i)})$

with $z^{(i,l)} = g_\phi(\epsilon^{(i,l)},x^{(i)})$ and $\epsilon^{(l)} \sim p(\epsilon)$

The KL-divergence does not need to be estimated this way in case it is available analytically. This estimator is then applied on mini-batches from a dataset. As example, they introduce the variational auto-encoder which is later used to generate data from the MNIST dataset and the Frey Faces dataset. In this case, the model distribution $q_\phi (z|x)$ is modeled by a neural network. In their case the neural network computes mean and standard deviation of a multivariate Gaussian (assuming that $z$ is continuous). Similarly, $p_\theta (x|z)$ is modeled using a neural network with the same idea. Details can be found in the appendix.

The optimization scheme is then summarized in Algorithm 1.

Algorithm 1: Mini-batch version of the variational auto-encoder presented by Kingma and Welling.

Compared to the wake-sleep algorithm, they show superior optimization regarding the approximated lower bound. More interestingly, however, they present examples of generated MNIST samples. For example, Figure 1 shows generated samples when using a two-dimensional latent variable $z$ and organizing the generated samples accordingly.

Figure 1 (

click to enlarge): Visualizations of generated samples for a two-dimensional latent variable $z$.