"Deep Learning", Chapter 20, Goodfellow, Bengio, Courville • David Stutz

JANUARY2017

READING

I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. Chapter 20, MIT Press, 2016.

In Chapter 20, Goodfellow et al. discuss deep generative models. First, a short note on the reading order for Chapters 16 to 20. Chapter 16 can be skipped with basic knowledge on graphical models, or should be replaced by an introduction to graphical models such as [1]. For chapter 17, discussing Monte Carlo methods, I recommend falling back to lecture notes or other resources to get a more detailed introduction and a better understanding of these methods. In Chapter 18, the sections on the log-likelihood gradient as well as contrastive divergence are rather important and should be read before starting with Chapter 20. Similarly, Chapter 19 introduces several approximate inference mechanisms used (or built upon) in Chapter 20.

The discussion first covers (restricted) Boltzmann machines. As energy-based model, the joint probability distribution is described as

$P(x) = \frac{\exp(-E(x))}{Z}$

with

E(x) = -x^TUx - b^Tx

where $U$ is a weight matrix and $b$ a bias vector. The variables $x$ are all observed in this simple model. Obviously, Boltzmann machines become more interesting when introducing latent variables. Thus, given observable variables $v$ and latent variables $h$, the energy can be defined as

$E(v,h) = -v^T R v - v^TWh - h^T S h - b^Tv - c^Th$

where $R$, $W$ and $S$ are weight matrices describing the interactions between the variables and $b$ and $c$ are bias vectors. As the partition function of Boltzmann machines is intractable, learning is generally based on approaches approximating the log-likelihood gradient (e.g. contrastive divergence as discussed in Chapter 18).

In practice, Boltzmann machines become relevant when restricting the interactions between variables. This leads to restricted Boltzmann machines where any two visible/hidden variables are restricted to not interact with each others. As consequence, the matrix $R$ and $S$ vanish:

$E(v,h) = -b^Tv - c^Th -v^TWH$

An interesting observation, also essential for efficient learning of restricted Boltzmann machines, is that the conditional probabilities $P(h|v)$ and $P(v|h)$ are easy to compute. This follows from the following derivation:

$P(h|v) = \frac{P(h,v)}{P(v)}$

$=\frac{1}{P(v)}\frac{1}{Z}\exp\{b^Tv + c^Th + v^TWh\}$

$=\frac{1}{Z'}\exp\{c^Th + v^TWh\}$

$=\frac{1}{Z'}\exp\{\sum_{j} c_j h_j + \sum_{j}v^TW_{:j}h_j\}$

$=\frac{1}{Z'}\prod_{j}\exp\{\sum_{j} c_j h_j + v^TW_{:j}h_j\}$

Where the definition of the conditional probability and the energy E(v, h where substituted. As the variables h are binary, we can take the unnormalized distribution and directly normalize it to obtain:

$P(h_j = 1 | v) = \frac{\tilde{P}(h_j = 1|v)}{\tilde{P}(h_j = 0|v) + \tilde{P}(h_j = 1 | v}$

$=\frac{\exp\{c_j + v^T W_{:j}\}}{\exp\{0\} + \exp\{c_j + v^TW_{:j}\}}$

$=\sigma\left(c_j + v^T W_{:j}\right)$

Efficient evaluation and differentiation of the unnormalized distribution and efficient sampling make restricted Boltzmann machines trainable using algorithms such as contrastive divergence.

[1] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.

IAM

DAVIDSTUTZ

READING

SEARCHTHEBLOG

ARCHIVES

TAGS