Jang et al. introduce the Gumbel Softmax distribution allowing to apply the reparameterization trick for Bernoulli distributions, as e.g. used in variational auto-encoders. Given a distribution $\pi = (\pi_1,\ldots,\pi_k)$ over classes $1,\ldots,k$, a categorical sample is assumed to be encoded in the one hot encoding, i.e. if the class is $i$ the vector $z \in \mathbb{R}^k$ is zero except for $z_i$ which is one. In order to apply the reparameterization trick, which has to be differentiable with respect to the input distribution, they first draw values $g_1, \ldots, g_k$ from the Gumbel distribution. The probability distribution function of a Gumbel distribution with parameters $\mu$ and $\beta$ has the form
For $g_i \sim Gumbel(0,1)$ categorical samples can be drawn as follows:
$z = one_hot(\arg\max_i g_i + \log \pi_i)$.
Note that $g_i \sim Gumbel(0,1)$ can be reparameterized as follows: $g_i = -\log(-\log(u_i))$ with $u_i \sim Uniform(0,1)$. In practice, the $\arg\max$ is approximated using the softmax function to sample vectors $y$:
The samples $z$ are then drawn from the so-called Gumbel-Softmax distribution which, for infinitely small $\tau$ approximates the Gumbel distribution. As the distribution is smooth for $\ŧau > 0$, it allows to appyl the reparameterization trick and sample near-categorical samples from a discrete distributions such as a Bernoulli distribution.
In experiments, they show that this technique allows to train variational auto-encoders with a discrete latent code, such as several Bernoulli variables.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.
Jang et al. introduce the Gumbel Softmax distribution allowing to apply the reparameterization trick for Bernoulli distributions, as e.g. used in variational auto-encoders. Given a distribution $\pi = (\pi_1,\ldots,\pi_k)$ over classes $1,\ldots,k$, a categorical sample is assumed to be encoded in the one hot encoding, i.e. if the class is $i$ the vector $z \in \mathbb{R}^k$ is zero except for $z_i$ which is one. In order to apply the reparameterization trick, which has to be differentiable with respect to the input distribution, they first draw values $g_1, \ldots, g_k$ from the Gumbel distribution. The probability distribution function of a Gumbel distribution with parameters $\mu$ and $\beta$ has the form
$Gumbel(\mu, \beta) = \frac{1}{\beta}\exp(-\frac{x - \mu}{\beta} + \exp(-\frac{x - \mu}{\beta}))$.
For $g_i \sim Gumbel(0,1)$ categorical samples can be drawn as follows:
$z = one_hot(\arg\max_i g_i + \log \pi_i)$.
Note that $g_i \sim Gumbel(0,1)$ can be reparameterized as follows: $g_i = -\log(-\log(u_i))$ with $u_i \sim Uniform(0,1)$. In practice, the $\arg\max$ is approximated using the softmax function to sample vectors $y$:
$y_i = \frac{\exp(\frac{\log \pi_i + g_i}{\tau}}{\sum_{j = 1}^k \exp(\frac{\log \pi_j + g_j}{\tau}}$
The samples $z$ are then drawn from the so-called Gumbel-Softmax distribution which, for infinitely small $\tau$ approximates the Gumbel distribution. As the distribution is smooth for $\ŧau > 0$, it allows to appyl the reparameterization trick and sample near-categorical samples from a discrete distributions such as a Bernoulli distribution.
In experiments, they show that this technique allows to train variational auto-encoders with a discrete latent code, such as several Bernoulli variables.