Eric Jang, Shixiang Gu, Ben Poole. Categorical Reparameterization with Gumbel-Softmax. CoRR, 2016.

Jang et al. introduce the Gumbel Softmax distribution allowing to apply the reparameterization trick for Bernoulli distributions, as e.g. used in variational auto-encoders. Given a distribution $\pi = (\pi_1,\ldots,\pi_k)$ over classes $1,\ldots,k$, a categorical sample is assumed to be encoded in the one hot encoding, i.e. if the class is $i$ the vector $z \in \mathbb{R}^k$ is zero except for $z_i$ which is one. In order to apply the reparameterization trick, which has to be differentiable with respect to the input distribution, they first draw values $g_1, \ldots, g_k$ from the Gumbel distribution. The probability distribution function of a Gumbel distribution with parameters $\mu$ and $\beta$ has the form

$Gumbel(\mu, \beta) = \frac{1}{\beta}\exp(-\frac{x - \mu}{\beta} + \exp(-\frac{x - \mu}{\beta}))$.

For $g_i \sim Gumbel(0,1)$ categorical samples can be drawn as follows:

$z = one_hot(\arg\max_i g_i + \log \pi_i)$.

Note that $g_i \sim Gumbel(0,1)$ can be reparameterized as follows: $g_i = -\log(-\log(u_i))$ with $u_i \sim Uniform(0,1)$. In practice, the $\arg\max$ is approximated using the softmax function to sample vectors $y$:

$y_i = \frac{\exp(\frac{\log \pi_i + g_i}{\tau}}{\sum_{j = 1}^k \exp(\frac{\log \pi_j + g_j}{\tau}}$

The samples $z$ are then drawn from the so-called Gumbel-Softmax distribution which, for infinitely small $\tau$ approximates the Gumbel distribution. As the distribution is smooth for $\┼žau > 0$, it allows to appyl the reparameterization trick and sample near-categorical samples from a discrete distributions such as a Bernoulli distribution.

In experiments, they show that this technique allows to train variational auto-encoders with a discrete latent code, such as several Bernoulli variables.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: