COMPUTERVISION RESEARCHSCIENTIST

RESEARCHSCIENTIST

21^{th}JANUARY2017

I. Goodfellow, Y. Bengio, A. Courville. *Deep Learning*. Chapter 16, MIT Press, 2016.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below or using the following platforms:

In Chapter 16, Goodfellow et al. briefly recap directed and undirected graphical models including d-separation, factor graphs and ancestral sampling. However, I found that there are better textbooks or chapters on graphical models. A similarly brief introduction can be found in [1] and an extensive discussion is available in [2].

In the end, they relate graphical models to deep learning yielding some interesting insights. While traditional graphic models usually have fewer unobserved variables and tend to have sparse connections such that exact inference is possible, the deep learning approach usually focusses on having many hidden, latent variables with dense connections in order to learn distributed representation. Exact inference is usually not expected to be possible and even marginals are not tractable. It is usually sufficient to be able to draw approximate samples and efficiently compute the gradient of the underlying energy function (while the energy itself does not need to be tractable).

Finally, they briefly introduce restricted Boltzmann machines (RBMs) (note that there might be more detailed discussions available). An RBM is an energy-based model with binary hidden and visible variables, $h$ and $v$, respectively:

$E(v,h) = -b^Tv - c^Th - v^TWh$

where $b$, $c$ and $W$ are real-valued parameters that are learned. Note that there is no interaction between any two hidden variables or any two visible variables (as illustrated by the $−b^T v$ and $−c^T h$ terms). Instead, only parts of visible and hidden variables are, usually densely, connected through the weight matrix $W$. The individual conditional distributions are easily computed by:

$P(h_i = 1 | v) = \sigma(v^T W_{i,j} + b_i)$

Overall, this allows for efficient Gibbs sampling. Furthermore, the energy is linear in all of its parameters such that the derivatives are easy to derive.

Pattern Recognition and Machine Learning. Springer, 2006.Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.