Check out our latest research on adversarial robustness and generalization of deep networks.


Chris J. Maddison, Andriy Mnih, Yee Whye Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. CoRR abs/1611.00712 (2016)

In parallel to the work by Jang et al. [1], Maddison et al. propose a continuous relaxation of discrete variables allowing to train variational auto-encoders with discrete latent code using backpropagation. While the idea of both works is the same, I highly recommend reading the paper by Maddison et al. - it is still challenging to read and it is not always trivial to follow the argumentation, but the appendix includes a very detailed discussion of practical considerations regarding the implementation of the proposed relaxation.


Considering a parameterization $\alpha = (\alpha_1,\ldots,\alpha_n)$ with $\alpha_k \in \mathbb{R}_+$ of a discrete distribution $D \sim Discrete(\alpha)$, the Gumbel-Max trick can be sumarized as follows. First, uniform random variables $U_k \sim Uniform(0,1)$ are sampled, then

$k = \arg\max_i \log \alpha_k - \log(-\log U_k)$

and $D_k$ is set to one (the reamining $D_i$, $i \neq k$ are set to zero). This corresponds to

$p(D_k = 1) = \frac{\alpha_k}{\sum_{i = 1}^n \alpha_i}$

The name comes from the Gumbel distribution, i.e. $- \log(-\log U_k)$ is distributed according to a Gumbel distribution, see Wikipedia for a short overview of the distribution.


The Gumbel-trick is not differentiable due to the $\arg\max$. Therefore, a temperature $\lambda \in \mathbb{R}_+$ Is chosen and the $\arg\max$ is replaced by the softmax

$X_k = \frac{\exp(\frac{\log \alpha_k + G_k}{\lambda})}{\sum_{i = 1}^n \exp(\frac{\log \alpha_i + G_i}{\lambda})}$.

Here, $G_k \sim Gumbel$ and $X$ is distributed according to the proposed Concrete distribution, i.e. $X \sim Concrete(\alpha,\lambda)$. The corresponding density is

$p_{\alpha,\lambda}(x) = (n – 1)! \lambda^{n – 1} \prod_{k = 1}^n \left(\frac{\alpha_k x_k^{-\lambda – 1}}{\sum_{i = 1}^n \alpha_i x_i^{-\lambda}}\right)$.


While the trick is easily explained, the practice looks different. In the framework of variational auto-encoders, the objective to optimize takes the form

$\mathcal{L}(\theta,a,\alpha) = \mathbb{E}_{Z \sim q_{\alpha, \lambda_1}(z|x)}\left[\log \frac{p_\theta(x|Z) p_{\alpha,\lambda_2}(Z)}{q_{\alpha,\lambda_1}(Z|x)}\right]$ (1)

where $p_{\alpha,\lambda_2}(z)$ is a Concrete density with probabilities $a$ and temparature $\lambda_2$ and $q_{\alpha,\lambda_1}(Z|x)$ is the recognition model (usually implemented as neural network) in the variational auto-encoder framework. The above equation already describes the relaxation where the recognition model and the prior on $Z$ are described using Concrete distributions.


For minimizing Equation (1), Maddison et al. Work with Concrete random variables in log-space. Given $G_k \sim Gumbel$, $\alpha \in \mathbb{R}_+^n$ and $\lambda \in \mathbb{R}_+$:

$Y_k = \frac{\log \alpha_k + G_k}{\lambda} - \log\left(\sum_{i = 1}^n \exp(\frac{\log\alpha_i + G_i}{\lambda}\right)$

such that $Y \in \mathbb{R}^n$ (i.e. continuous) and $\exp(Y) \sim Concrete(\alpha, \lambda)$. The newly found $ExpConcrete$ distribution has the log-density

$\log \rho_{\alpha,\lambda}(y) = \log((n – 1)!) + (n – 1)\log\lambda + \left(\sum_{k = 1}^n \log \alpha_k - \lambda y_k\right) – n \log\left( \sum_{k = 1}^n \exp(\log \alpha_k - \lambda y_k)\right)$

with $y \in \mathbb{R}^n$ with $\sum_{k = 1}^n y_k = 1$. Using the $ExpConcrete$ distribution, the objective changes to

$\mathbb{E}_{Y \sim \kappa_{\alpha,\lambda_1}(y|x)}\left[\log p_\theta(x|\exp(Y)) + \log \frac{\rho_{a,\lambda_2}(Y)}{\kappa_{\alpha, \lambda_1}(Y|x)}\right]$

where $\rho_{a,\lambda_2}(y)$ is the $ExpConcrete$ density corresponding to the prior and $\kappa_{\alpha,\lambda_1}(y|x)$ corresponds to the $ExpConcrete$ density of the recognition model in the variational auto-encoder framework. Then, the $ExpConcrete$ random variables are treates as the stoachstic nodes in the variational auto-encoder an followed by and $\exp$ operation.


The binary case is even simpler. The binary random variable is sampled as

$Y = \frac{\log \alpha + \log U - \log(1 – U)}{\lambda}$

with $U \sim Uniform(0,1)$ and the random variable is followed by a Sigmoid operation. The corresponding log-density is

$\log \rho_{\alpha,\lambda} = \log \lambda - \lambda y + \log \alpha – 2 \log(1 + \exp(-\lambda y + \log \alpha))$

where $\alpha, \lambda \in \mathbb{R}_+$.


While in the orginal variational auto-encoder, the KL divergence of the objective is computed analytically, this is not discussed by Maddison et al. As no other clue is given, I assume that the KL divergence is estimated using the Monte Carlo approach also used for $\mathbb{E}_Y \sim \kappa_{\alpha,\lambda_1}(y|x)\left[\log p_\theta (x|\exp(Y))\right]$ (e.g. using one sample in practice).

  • [1] E. Jang, S. Gu, B. Poole. Categorical Reparameterization with Gumbel-Softmax. ICLR, 2017.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: