27^{th}FEBRUARY2018

Chris J. Maddison, Andriy Mnih, Yee Whye Teh. *The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables*. CoRR abs/1611.00712 (2016)

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below or get in touch with me:

In parallel to the work by Jang et al. [1], Maddison et al. propose a continuous relaxation of discrete variables allowing to train variational auto-encoders with discrete latent code using backpropagation. While the idea of both works is the same, I highly recommend reading the paper by Maddison et al. - it is still challenging to read and it is not always trivial to follow the argumentation, but the appendix includes a very detailed discussion of practical considerations regarding the implementation of the proposed relaxation.

Considering a parameterization $\alpha = (\alpha_1,\ldots,\alpha_n)$ with $\alpha_k \in \mathbb{R}_+$ of a discrete distribution $D \sim Discrete(\alpha)$, the Gumbel-Max trick can be sumarized as follows. First, uniform random variables $U_k \sim Uniform(0,1)$ are sampled, then

$k = \arg\max_i \log \alpha_k - \log(-\log U_k)$

and $D_k$ is set to one (the reamining $D_i$, $i \neq k$ are set to zero). This corresponds to

$p(D_k = 1) = \frac{\alpha_k}{\sum_{i = 1}^n \alpha_i}$

The name comes from the Gumbel distribution, i.e. $- \log(-\log U_k)$ is distributed according to a Gumbel distribution, see Wikipedia for a short overview of the distribution.

The Gumbel-trick is not differentiable due to the $\arg\max$. Therefore, a temperature $\lambda \in \mathbb{R}_+$ Is chosen and the $\arg\max$ is replaced by the softmax

$X_k = \frac{\exp(\frac{\log \alpha_k + G_k}{\lambda})}{\sum_{i = 1}^n \exp(\frac{\log \alpha_i + G_i}{\lambda})}$.

Here, $G_k \sim Gumbel$ and $X$ is distributed according to the proposed Concrete distribution, i.e. $X \sim Concrete(\alpha,\lambda)$. The corresponding density is

$p_{\alpha,\lambda}(x) = (n – 1)! \lambda^{n – 1} \prod_{k = 1}^n \left(\frac{\alpha_k x_k^{-\lambda – 1}}{\sum_{i = 1}^n \alpha_i x_i^{-\lambda}}\right)$.

While the trick is easily explained, the practice looks different. In the framework of variational auto-encoders, the objective to optimize takes the form

$\mathcal{L}(\theta,a,\alpha) = \mathbb{E}_{Z \sim q_{\alpha, \lambda_1}(z|x)}\left[\log \frac{p_\theta(x|Z) p_{\alpha,\lambda_2}(Z)}{q_{\alpha,\lambda_1}(Z|x)}\right]$ (1)

where $p_{\alpha,\lambda_2}(z)$ is a Concrete density with probabilities $a$ and temparature $\lambda_2$ and $q_{\alpha,\lambda_1}(Z|x)$ is the recognition model (usually implemented as neural network) in the variational auto-encoder framework. The above equation already describes the relaxation where the recognition model and the prior on $Z$ are described using Concrete distributions.

For minimizing Equation (1), Maddison et al. Work with Concrete random variables in log-space. Given $G_k \sim Gumbel$, $\alpha \in \mathbb{R}_+^n$ and $\lambda \in \mathbb{R}_+$:

$Y_k = \frac{\log \alpha_k + G_k}{\lambda} - \log\left(\sum_{i = 1}^n \exp(\frac{\log\alpha_i + G_i}{\lambda}\right)$

such that $Y \in \mathbb{R}^n$ (i.e. continuous) and $\exp(Y) \sim Concrete(\alpha, \lambda)$. The newly found $ExpConcrete$ distribution has the log-density

$\log \rho_{\alpha,\lambda}(y) = \log((n – 1)!) + (n – 1)\log\lambda + \left(\sum_{k = 1}^n \log \alpha_k - \lambda y_k\right) – n \log\left( \sum_{k = 1}^n \exp(\log \alpha_k - \lambda y_k)\right)$

with $y \in \mathbb{R}^n$ with $\sum_{k = 1}^n y_k = 1$. Using the $ExpConcrete$ distribution, the objective changes to

$\mathbb{E}_{Y \sim \kappa_{\alpha,\lambda_1}(y|x)}\left[\log p_\theta(x|\exp(Y)) + \log \frac{\rho_{a,\lambda_2}(Y)}{\kappa_{\alpha, \lambda_1}(Y|x)}\right]$

where $\rho_{a,\lambda_2}(y)$ is the $ExpConcrete$ density corresponding to the prior and $\kappa_{\alpha,\lambda_1}(y|x)$ corresponds to the $ExpConcrete$ density of the recognition model in the variational auto-encoder framework. Then, the $ExpConcrete$ random variables are treates as the stoachstic nodes in the variational auto-encoder an followed by and $\exp$ operation.

The binary case is even simpler. The binary random variable is sampled as

$Y = \frac{\log \alpha + \log U - \log(1 – U)}{\lambda}$

with $U \sim Uniform(0,1)$ and the random variable is followed by a Sigmoid operation. The corresponding log-density is

$\log \rho_{\alpha,\lambda} = \log \lambda - \lambda y + \log \alpha – 2 \log(1 + \exp(-\lambda y + \log \alpha))$

where $\alpha, \lambda \in \mathbb{R}_+$.

While in the orginal variational auto-encoder, the KL divergence of the objective is computed analytically, this is not discussed by Maddison et al. As no other clue is given, I assume that the KL divergence is estimated using the Monte Carlo approach also used for $\mathbb{E}_Y \sim \kappa_{\alpha,\lambda_1}(y|x)\left[\log p_\theta (x|\exp(Y))\right]$ (e.g. using one sample in practice).