COMPUTERVISION RESEARCHSCIENTIST

RESEARCHSCIENTIST

11^{th}MARCH2017

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Ole Winther. *Autoencoding beyond pixels using a learned similarity metric*. ICML, 2016.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below or using the following platforms:

Larsen et al. combine variational auto-encoders [1] (VAE) with generative adversarial networks [2] (GAN). The goal is to use the feature representation learned by the GAN to improve the reconstruction objective of the VAE. The motivation is that the reconstruction error used for VAEs is usually a pixel-wise metric.

A variational auto-necoder consists of two networks, an encoder and a decoder:

$z \sim \text{Enc}(x)=q(z|x)$, $\tilde{x}\sim \text{Dec}(z) = p(x|z)$

Where $x$ denotes the input and $z$ the code corresponding to $x$. Note that $q(z|x)$ originally takes the task of a model distribution used to optimize the variational lower bound. Therefore, the loss to minimize takes the form:

$L_{\text{VAE}} = \underbrace{-E_{q(z|x)}[\log p(x|z)]}_{L_{\text{llike}}} + D_{\text{KL}}(q(z|x)|p(z)]$(1)

where the KL-divergence is either computed analytically (if possible), or both terms are approximated using a Monte-Carlo approach based on an auxiliary variable, see [1] for details.

Generative adversarial networks consist of a generator network and a discriminator network. The discriminator tries to tell whether the input $x$ generated by the generator based on a drawn $z$ is real or indeed generated by the generator. The generator tries to fool the discriminator. Therefore, the loss is given by

$L_\text{GAN} = \log(\text{Dis}(x)) + \log (1 - \text{Dis}(\text{Gen}(z)))$.

As the discriminator needs to learn appropriate representations of the input $x$, the idea of Larsen et al. is to use the features in intermediate layers in order to define a new metric used for the VAE. Specifically, they introduce a Gaussian observation model for the features of the $l$-th layer in the discriminator, referred to by $\text{Dis}_l(x)$:

$p(\text{Dis}_l(x)|z) = \mathcal{N}(\text{Dis}_l(x)|\text{Dis}_l(\tilde{x}), I)$.

Here, $\tilde{x}\sim \text{Dec}(z)$ is the sample from the decoder of $x$. Overall, the first term in Equation (1) is replaced by

$L_{\text{llike}}^{\text{Dis}_l} = -E_{q(z|x)}[\log p(\text{Dis}_l(x)|z)]$

The overall objective of the proposed VAE-GAN model is given by

$L = L_{\text{VAE}} + L_{\text{GAN}}$.

As the decoder in the VAE framework and the generator in the GAN framework model the same functionality, they share their parameters. The model is illustrated in in Figure 1.

Figure 1 (

click to enlarge): Illustration of the VAE-GAN model, a combination of variational auto-encoder and generative adversarial network.In order to get the VAE-GAN model working in practice, Larsen et al. provide the following practical approaches to make training feasible:

$L_{\text{GAN}} = \log(\text{Dis}(x)) + \log(1 - \text{Dis}(\text{Dec}(z))) + \log(1 - \text{Dis}(\text{Dec}(\text{Enc}(x))))$

Larsen et al. provide experimental results on the CelebA faces dataset. For example, Figure 1 shows generative results comparing VAE, GAN and VAE-GAN.

Figure 1 (

click to enlarge): Generative samples from the different models trained on the CelebA dataset.Auto-encoding variational Bayes. International Conference on Learning Representations, 2014.Generative adversarial nets. CoRR, abs/1406.2661.