Lars M. Mescheder, Sebastian Nowozin, Andreas Geiger. Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks. ICML, 2017.

Mescheder et al. unify variational auto-encoders (VAEs) [1] and generative adversarial networks (GANs) [2] thereby allowing to replace the variational encoder by a black-box inference model as illustrated in Figure 1.

To recapitulate, Mescheder et al. first consider an ordinary VAE consisting of a generative model $p_\theta(x | z)$ (parameterized by $\theta$) of the visible variables $x$ given the latent code $z$, a prior on the latent codes $p(z)$ and an approximate inference model $q_\psi(z|x)$ over the latent variables given the visible ones. In [1] it is derived that VAEs follow a mini-max optimization problem:

$\max_\theta \max_\phi E_{p_{\mathcal{D}}(z)}\left[-\text{KL}(q_\phi(z | x), p(z)) + E_{q_\phi(z|x)} \log p_\theta(x|z)\right].$(1)

where $\text{KL}$ refers to the Kullback-Leibler divergence. In the ordinary VAE framework, the inference model $q_\psi(z|x)$ is usually taken to be a Gaussian distribution with diagonal variance where mean and variance are parameterized by a neural network. This model is very restrictive regarding the latent code $z$.

To derive the proposed model, the optimization problem in Equation (1) is rewritten to

$\max_\theta \max_\phi E_{p_{\mathcal{D}}(x)} E_{q_\phi(z|x)} (\log p(z) - \log q_\phi(z | x) + \log p_\theta(x | z))$.(2)

Now assuming $q_\psi(z|x)$ to be given by a black-box model, this can no longer be optimized using gradient descent. Therefore, Mescheder et al. introduce a discriminative network to represent

$\log p(z) - \log q_\phi(z|x)$(3)

The discriminator network is supposed to attain Equation (3) as optimal value. To this end, they propose the following objective for the discriminator network $T(x, z)$:

$\max_T E_{p_{\mathcal{D}}(x)} E_{q_\phi(z|x)} \log \sigma(T(x, z)) + E_{p_{\mathcal{D}}(x)} E_{p(z)} \log(1 - \sigma(T(x,z)))$.

where $\sigma(t)$ denotes the sigmoid function. The idea is to let $T$ distinguish input-code pairs $(x, z)$ as being sample from $p(x)p(z)$ or from $p(x)q_\psi(z|x)$. Letting $T^\ast$ denote the optimal discriminator, Equation (2) can be rewritten as

$\max_{\theta, \phi} E_{p_{\mathcal{D}}(x)}E_\epsilon (- T^\ast(x, z_\phi(x, \epsilon)) + \log p_\theta(x | z_\phi(x, \epsilon)))$

To optimize this, the gradients with respect to $\theta$ and $\phi$ need to be computed. This is problematic as $T^\ast$ is implicitly defined as a solution of a problem depending on $\psi$. However, Mescheder et al. show that taking the gradients of $T^\ast$ with respect to $\psi$ is not necessary as

$E_{q_\phi(z|x)} ( \nabla_\phi T^\ast(x,z)) = 0$

(see the paper for proof). Then, the objective can be rewritten using the reparameterization trick [1]:

$\max_{ßtheta,\phi} E_{p_{\mathcal{D}}(x)}E_\epsilon ( -T^\ast(x, z_\phi(x, \epsilon)) + \log p_\theta(x | z_\phi(x, \epsilon)))$

In practice, (x) and (y) are optimized jointly as in Algorithm 1.

In the paper, they also discuss the case where $T$ fails to come sufficiently close to the optimal solution and, thus, hinders optimization. I refer to the paper for details.


Algorithm 1: Adversarial Variational Bayes in pseudo-code.

  • [1] Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [2] Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: