Denoising Variational Auto-Encoders

A variational auto-encoder trained on corrupted (that is, noisy) examples is called denoising variational auto-encoder. While easily implemented, the underlying mathematical framework changes significantly. As the second article in my series on variational auto-encoders, this article discusses the mathematical background of denoising variational auto-encoders.

This is the second article in my series on variational auto-encoders. Before reading on, check out the first article introducing the mathematics of variational inference and variational auto-encoders. This article will introduce an extension to the vanilla variational auto-encoder []: the denoising variational auto-encoder [].

Prerequisites. As before, this article requires basic understanding of linear algebra, probability theory — including conditional probability, Gaussian distributions, Bernoulli distributions and expectations &mdash as well as (convolutional) neural networks. The first article on regular variational auto-encoders will also benefit understanding.


Similar to denoising auto-encoders, Im et al. [] propose to train a variational auto-encoder on noisy samples and show that this variant effectively optimizes a different lower bound on the likelihood. Let us recall evidence lower bound from the previous article,

$-\text{KL}(q(z|y) | p(z)) + \mathbb{E}_{q(z|y)}[\ln p(y|z)] = \mathbb{E}_{q(z|y)}\left[\ln\frac{p(y, z)}{q(z|y)}\right]$,(1)

where $\mathbb{E}$ denotes the expectation, $\text{KL}$ refers to the Kullback-Leibler divergence and $q(z|y)$, $p(y|z)$ and $p(z)$ refer to the recognition model (encoder), generative model (decoder) and prior, respectively. Then, a denoising variational auto-encoder introduces an additional conditional probability $p(y'|y)$ representing the noise model (i.e. $y'$ is the noisy example derived from $y$). Effectively, this results in an adapted recognition model

$\tilde{q}(z|y) = \int p(z|y') q(y'|y) dy'$.

While it cannot be shown that the resulting lower bound is strictly "better" than the original one above, we can argue that the new recognition model $\tilde{q}(z|y)$ covers a broader class of distributions and is, thus, more expressive.

Denoising Variational Auto-Encoder

As indicated above, the original approximate posterior $q(z|y)$ after introducting the corruption model $q(y'|y)$ can be expressed as

$\tilde{q}(z|y) = \int p(z|y') q(y'|y) dy'$.

Following [], a useful interpretation is given in the following example:

Example. Considering a Gaussian corruption model $q(y'|y) = \mathcal{N}(y';0, \sigma^2I_R)$ and a posterior $q(z|y') = \mathcal{N}(z;\mu(y'), \text{diag}(\sigma^2(y')))$ where the parameters $\mu(y')$, $\sigma^2(y')$ are modeled using a possibly non-linear neural network, then

$\mathbb{E}_{q(y'|y)}[q(z|y')] = \int q(z|y')q(y'|y) dy' = \tilde{q}(z|y)$

is an infinite mixture of Gaussians. In this sense, the recognition model $\tilde{q}(z|y)$ is more powerful because it can cover a broader class of distributions compared to $q(z|y)$, e.g. multi-modal distributions. On the other hand, the corruption model $q(y'|y)$ may also remove all necessary information for the reconstruction.

Similar to the variational auto-encoder, we intend to optimize a lower bound on the log-likelihood — the evidence lower bound in the corrupted case can be written as:

$\mathbb{E}_{\tilde{q}(z|y)}\left[\ln\frac{p(y,z)}{\tilde{q}(z|y)}\right] = \mathbb{E}_{\tilde{q}(z|y)}\left[\ln\frac{p(y,z)}{\mathbb{E}_{q(y'|y)}[q(z|y')]}\right]$(2)

This is just Equation (1) applied to the new recognition model $\tilde{q}$. Note that in the nominator, we still have the uncorrupted $y$ — the clean trainng examples. The resulting objective (see below) is therefore computed between the uncorrupted example $y$ and the predicted reconstruction. However, we would like to pull the expectation in the denominator outside of the $\ln$ in order to obtain


which can easily be implemented by training a regular variational auto-encoder on corrupted examples. At this point, the question of interest can be posed as follows: Can we optimize Equation (3) instead of Equation (2) (which is the true evidence lower bound of the problem)? We can find an answer in the following lemma:

Lemma. Given the approximate, corrupted posterior

$\tilde{q}(z|y) = \int q(z|y') q(y'|y) dy'$

and the corresponding generative model $p(y,z) = p(y|z)p(z)$, it holds:

$\ln p(y) \geq \mathbb{E}_{\tilde{q}(z|y)}\left[\ln\frac{p(y,z)}{q(z|y')}\right] \geq \mathbb{E}_{\tilde{q}(z|y)}\left[\ln\frac{p(y,z)}{q(z|y)}\right]$.

Proof. The left inequality states

$\mathbb{E}_{\tilde{q}(z|y)}\left[\ln\frac{p(y,z)}{q(z|y')}\right] = \int \tilde{q}(z|y) \ln\frac{p(y,z)}{q(z|y')} dz$

$= \int \int q(z|y')q(y'|y) \ln\frac{p(y,z)}{q(z|y')} dy'dz$

$= \mathbb{E}_{q(y'|y)}\mathbb{E}_{q(z|y')} \left[\ln\frac{p(y,z)}{q(z|y')}\right].$

Now, we can apply Jensen's inequality (e.g see the corresponding Wikipedia entry) to pull out the $\ln$ within the expectation:

$\mathbb{E}_{q(y'|y)}\mathbb{E}_{q(z|y')} \left[\ln\frac{p(y,z)}{q(z|y')}\right] \leq \ln \mathbb{E}_{q(y'|y)}\mathbb{E}_{q(z|y')} \left[\frac{p(y,z)}{q(z|y')}\right]$.

Simplifying the right-hand side yields

$\ln \mathbb{E}_{q(y'|y)}\mathbb{E}_{q(z|y')} \left[\frac{p(y,z)}{q(z|y')}\right]$

$= \ln \int \int q(z|y')q(y'|y) \frac{p(y,z)}{q(z|y')} dy'dz$

$= \ln \int \int q(y'|y) p(y, z) dy' dz = \ln p(y)$

which concludes the proof for the left inequality. For the right inequality, we write

$\mathbb{E}_{\tilde{q}(z|y)}\left[\ln\frac{p(y,z)}{q(z|y')}\right] = \mathbb{E}_{\tilde{q}(z|y)}[\ln p(y,z)] - \mathbb{E}_{\tilde{q}(z|y)}[\ln q(z|y')]$.

We can now use the fact that

$\mathbb{E}_{\tilde{q}(z|y)}[\ln q(z|y')] \leq \mathbb{E}_{\tilde{q}(z|y)}[\tilde{q}(z|y)]$

which also follows from Jensen's inequality by considering

$\mathbb{E}_{\tilde{q}(z|y)}[\ln q(z|y')] - \mathbb{E}_{\tilde{q}(z|y)}[\tilde{q}(z|y)]$

$= \mathbb{E}_{\tilde{q}(z|y)}\left[\ln \frac{q(z|y')}{\tilde{q}(z|y)}\right]$

$\leq \ln \mathbb{E}_{\tilde{q}(z|y)}\left[\frac{q(z|y')}{\tilde{q}(z|y)}\right]$

$= \ln \int \tilde{q}(z|y) \frac{q(z|y')}{\tilde{q}(z|y)} dz = \ln 1 = 0$.

Then, we obtain


$= \mathbb{E}_{\tilde{q}(z|y)}[\ln p(y,z)] - \mathbb{E}_{\tilde{q}(z|y)}[\ln q(z|y')]$

$\geq \mathbb{E}_{\tilde{q}(z|y)}[\ln p(y,z)] - \mathbb{E}_{\tilde{q}(z|y)}[\tilde{q}(z|y)]$

$= \mathbb{E}_{\tilde{q}(z|y)}\left[\ln\frac{p(y,z)}{\tilde{q}(z|y)}\right]$

which is the second inequality.

As a result, we can indeed use Equation(3) as surrogate objective without any (negative) lower bound on the true evidence lower bound. In practice, we use

$\mathbb{E}_{\tilde{q}(z|y)}\left[\ln\frac{p(y,z)}{q(z|y')}\right] = \mathbb{E}_{q(y'|y)}\mathbb{E}_{q(z|y')} \left[\ln\frac{p(y,z)}{q(z|y')}\right]$

$= \mathbb{E}_{q(y'|y)}\left[\mathbb{E}_{q(z|y')}[\ln p(y|z)] - \text{KL}(q(z|y')|p(z))\right]$.

Similar to the regular variational auto-encoder, the objective is splitted into the reconstruction error and the Kullback-Leibler divergence. However, the recognition model now learns to approximate the prior $p(z)$ given corrupted examples; this also implies that the reconstruction error can now be seen as a denoising criterion.

The overall minimization objective can then be written as

$\mathcal{L}_{\text{DVAE}}(w) = \frac{1}{L} \sum_{l = 1}^L \left[\text{KL}(q(z|y'_{l,m})|p(z)) + \sum_{l' = 1}^{L'} \ln p(y_m|z_{l,l',m})\right]$

where $y'_{l,m} \sim q(y'|y_m)$ and $z_{l,l',m} = g(y'_{l,m}, \epsilon)$ with $\epsilon_{l,l',m} \sim \mathcal{N}(\epsilon;0,1)$. In practice, $L = L' = 1$ is used. Here, sampling $y'_{l,m} \sum q(y',y_m)$ is the only difference to the vanilla variational auto-encoder.


The next article will cover variational auto-encoders with discrete latent variables. Afterwards we will discus a Torch implementation of the introduced concepts.

  • [] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. CoRR, abs/1601.00670, 2016.
  • [] C. Doersch. Tutorial on variational autoencoders. CoRR, abs/1606.05908, 2016.
  • [] D. P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
  • [] S. Kullback. Information Theory and Statistics. Wiley, New York, 1959.
  • [] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61:611-622, 1999.
  • [] D. J. Im, S. Ahn, R. Memisevic, and Y. Bengio. Denoising criterion for variational auto-encoding framework. In AAAI Conference on Artificial Intelligence, pages 2059-2065, 2017.
  • [] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. CoRR, abs/1611.01144, 2016.
  • [] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712, 2016.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.