20^{th}MARCH2018

Lars M. Mescheder, Sebastian Nowozin, Andreas Geiger. *Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks*. ICML, 2017.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below or get in touch with me:

Mescheder et al. unify variational auto-encoders (VAEs) [1] and generative adversarial networks (GANs) [2] thereby allowing to replace the variational encoder by a black-box inference model as illustrated in Figure 1.

To recapitulate, Mescheder et al. first consider an ordinary VAE consisting of a generative model $p_\theta(x | z)$ (parameterized by $\theta$) of the visible variables $x$ given the latent code $z$, a prior on the latent codes $p(z)$ and an approximate inference model $q_\psi(z|x)$ over the latent variables given the visible ones. In [1] it is derived that VAEs follow a mini-max optimization problem:

$\max_\theta \max_\phi E_{p_{\mathcal{D}}(z)}\left[-\text{KL}(q_\phi(z | x), p(z)) + E_{q_\phi(z|x)} \log p_\theta(x|z)\right].$(1)

where $\text{KL}$ refers to the Kullback-Leibler divergence. In the ordinary VAE framework, the inference model $q_\psi(z|x)$ is usually taken to be a Gaussian distribution with diagonal variance where mean and variance are parameterized by a neural network. This model is very restrictive regarding the latent code $z$.

To derive the proposed model, the optimization problem in Equation (1) is rewritten to

$\max_\theta \max_\phi E_{p_{\mathcal{D}}(x)} E_{q_\phi(z|x)} (\log p(z) - \log q_\phi(z | x) + \log p_\theta(x | z))$.(2)

Now assuming $q_\psi(z|x)$ to be given by a black-box model, this can no longer be optimized using gradient descent. Therefore, Mescheder et al. introduce a discriminative network to represent

$\log p(z) - \log q_\phi(z|x)$(3)

The discriminator network is supposed to attain Equation (3) as optimal value. To this end, they propose the following objective for the discriminator network $T(x, z)$:

$\max_T E_{p_{\mathcal{D}}(x)} E_{q_\phi(z|x)} \log \sigma(T(x, z)) + E_{p_{\mathcal{D}}(x)} E_{p(z)} \log(1 - \sigma(T(x,z)))$.

where $\sigma(t)$ denotes the sigmoid function. The idea is to let $T$ distinguish input-code pairs $(x, z)$ as being sample from $p(x)p(z)$ or from $p(x)q_\psi(z|x)$. Letting $T^\ast$ denote the optimal discriminator, Equation (2) can be rewritten as

$\max_{\theta, \phi} E_{p_{\mathcal{D}}(x)}E_\epsilon (- T^\ast(x, z_\phi(x, \epsilon)) + \log p_\theta(x | z_\phi(x, \epsilon)))$

To optimize this, the gradients with respect to $\theta$ and $\phi$ need to be computed. This is problematic as $T^\ast$ is implicitly defined as a solution of a problem depending on $\psi$. However, Mescheder et al. show that taking the gradients of $T^\ast$ with respect to $\psi$ is not necessary as

$E_{q_\phi(z|x)} ( \nabla_\phi T^\ast(x,z)) = 0$

(see the paper for proof). Then, the objective can be rewritten using the reparameterization trick [1]:

$\max_{ßtheta,\phi} E_{p_{\mathcal{D}}(x)}E_\epsilon ( -T^\ast(x, z_\phi(x, \epsilon)) + \log p_\theta(x | z_\phi(x, \epsilon)))$

In practice, (x) and (y) are optimized jointly as in Algorithm 1.

In the paper, they also discuss the case where $T$ fails to come sufficiently close to the optimal solution and, thus, hinders optimization. I refer to the paper for details.

Algorithm 1: Adversarial Variational Bayes in pseudo-code.

Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.Generative adversarial nets.In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.