21^{th}JANUARY2017

I. Goodfellow, Y. Bengio, A. Courville. *Deep Learning*. Chapter 19, MIT Press, 2016.

What is **your opinion** on this article? **Let me know** your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.

In Chapter 19, Goodfellow et al. discuss approximate inference as optimization. While concrete examples, especially regarding the deep models discussed in Chapter 20, are missing, the main idea behind approximate inference is discussed in more detail. As motivation, they illustrate why the posterior distribution, i.e. $p(h|v)$ where $v$ are visible and $h$ are hidden variables, is usually intractable in layered models. Figure 1 shows the discussed examples, corresponding to a semi-restricted Boltzmann machine on the left, a restricted Boltzmann machine in the middle, and a directed model on the right. In all three cases the posterior is intractable due to interactions between the hidden variables - directly or indirectly.

Figure 1 (

click to enlarge): Illustration of three graphical models as commonly used for deep learning. In all three cases, the direct or indirect interactions between hidden variables prevent the posterior from being tractable.In order to approximate $p(h|v;\theta)$ with $\theta$ being parameters, the main idea behind approximate inference is based on the evidence lower bound $\mathcal{L}(v,\theta,q)$ on $\log p(h|v;\theta)$:

$\mathcal{L}(c,\theta,q) = \log p(v;\theta) - D_{KL}\left(q(h|v) | p(h |v;\theta)\right)$

Here, $q$ is an arbitrary distribution defining the tightness of the lower bound. Specifically, if $q$ and $p$ are almost equal, the lower bound becomes exact. This is due to the definition of the Kullback-Leibler divergence:

$D_{KL} \left(q(h|v)│p(h | v;θ)\right)=E_(h\sim q) \left[\log\left(\frac{q(h|v)}{p(h|v;\theta)}\right)\right]$

Rewriting the evidence lower bound using the definition of the Kullback-Leibler divergence and using basic logarithmic identities gives:

$\mathcal{L}(v,\theta,q) = E_{h \sim q}[\log p(h,v)] + H(q)$

with $H(q)$ being the entropy. Inference can, thus, be stated as optimizing for the ideal $q$. When restricting the family of distributions, $\mathcal{L}(v,\theta,q)$ may become tractable.

Variational inference means to choose $q$ from a restricted set of families. The mean field approximation defines $q$ to factor as follows:

$q(h|v) = \prod_i q(h_i|v)$

In the discrete case, the distribution $q$ can be parameterized by vectors of probability, resulting in a rather straight-forward optimization problem. In the continuous case, calculus of variation is applicable. Researchers have early derived a general fix point equation to use. Specifically, fixing all $q(h_j |v)$ for $j \neq i$, the optimal $q(h_i |v)$ is given by the normalized distribution corresponding to:

$\tilde{q}(h_i | v) = \exp\left(E_{h_{-i} \sim q(h_{-i}| v)} [\log \tilde{p}(v,h)]\right)$

This equation is frequently referred to in practice wherever the mean field approximation is used.

Unfortunately, Goodfellow et al. discuss the discrete and continuous case of the mean field approximation in a rather technical way given two specific examples not related to the deep models described in Chapter 20 (at least personally, I see no benefit in having read the examples).