DAVIDSTUTZ

Check out the latest superpixel benchmark — Superpixel Benchmark (2016) — and let me know your opinion! @david_stutz
11thMARCH2017

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Ole Winther. Autoencoding beyond pixels using a learned similarity metric. ICML, 2016.

Larsen et al. combine variational auto-encoders [1] (VAE) with generative adversarial networks [2] (GAN). The goal is to use the feature representation learned by the GAN to improve the reconstruction objective of the VAE. The motivation is that the reconstruction error used for VAEs is usually a pixel-wise metric.

A variational auto-necoder consists of two networks, an encoder and a decoder:

$z \sim \text{Enc}(x)=q(z|x)$, $\tilde{x}\sim \text{Dec}(z) = p(x|z)$

Where $x$ denotes the input and $z$ the code corresponding to $x$. Note that $q(z|x)$ originally takes the task of a model distribution used to optimize the variational lower bound. Therefore, the loss to minimize takes the form:

$L_{\text{VAE}} = \underbrace{-E_{q(z|x)}[\log p(x|z)]}_{L_{\text{llike}}} + D_{\text{KL}}(q(z|x)|p(z)]$(1)

where the KL-divergence is either computed analytically (if possible), or both terms are approximated using a Monte-Carlo approach based on an auxiliary variable, see [1] for details.

Generative adversarial networks consist of a generator network and a discriminator network. The discriminator tries to tell whether the input $x$ generated by the generator based on a drawn $z$ is real or indeed generated by the generator. The generator tries to fool the discriminator. Therefore, the loss is given by

$L_\text{GAN} = \log(\text{Dis}(x)) + \log (1 - \text{Dis}(\text{Gen}(z)))$.

As the discriminator needs to learn appropriate representations of the input $x$, the idea of Larsen et al. is to use the features in intermediate layers in order to define a new metric used for the VAE. Specifically, they introduce a Gaussian observation model for the features of the $l$-th layer in the discriminator, referred to by $\text{Dis}_l(x)$:

$p(\text{Dis}_l(x)|z) = \mathcal{N}(\text{Dis}_l(x)|\text{Dis}_l(\tilde{x}), I)$.

Here, $\tilde{x}\sim \text{Dec}(z)$ is the sample from the decoder of $x$. Overall, the first term in Equation (1) is replaced by

$L_{\text{llike}}^{\text{Dis}_l} = -E_{q(z|x)}[\log p(\text{Dis}_l(x)|z)]$

The overall objective of the proposed VAE-GAN model is given by

$L = L_{\text{VAE}} + L_{\text{GAN}}$.

As the decoder in the VAE framework and the generator in the GAN framework model the same functionality, they share their parameters. The model is illustrated in in Figure 1.

In order to get the VAE-GAN model working in practice, Larsen et al. provide the following practical approaches to make training feasible:

• The discriminator should not try to minimize $L_{\text{llike}}^{\text{Dis}_l}$; and the encoder should not be trained by backpropagating the errors from $L_{\text{GAN}}$.
• In practice, $L_{\text{GAN}}$ and $L_{\text{llike}}^{\text{Dis}_l}$ are weighted by a parameter $\gamma$, however not in the complete model but only when performing the weight updates for the decoder.
• In addition to samples from $p(z)$, Larsen et al. also use samples from $q(z|x)$. Then, $L_{\text{GAN}}$ takes the form:

$L_{\text{GAN}} = \log(\text{Dis}(x)) + \log(1 - \text{Dis}(\text{Dec}(z))) + \log(1 - \text{Dis}(\text{Dec}(\text{Enc}(x))))$

Larsen et al. provide experimental results on the CelebA faces dataset. For example, Figure 1 shows generative results comparing VAE, GAN and VAE-GAN.

• [1] Diederik P. Kingma, Max Welling. Auto-encoding variational Bayes. International Conference on Learning Representations, 2014.
• [2] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Generative adversarial nets. CoRR, abs/1406.2661.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: