"Deep Learning", Chapter 18, Goodfellow, Bengio, Courville • David Stutz

JANUARY2017

READING

I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. Chapter 18, MIT Press, 2016.

In Chapter 18, after briefly introducing Monte Carlo methods in chapter 17 and graphical models in Chapter 16, Goodfellow et al. discuss the problem of computing the partition function. Most (undirected) graphical models, especially deep probabilistic models, are defined by an unnormalized probability distribution $\tilde{p}(x)$, often based on an energy. Computing the normalization constant $Z$, i.e. the partition function, is often intractable:

$\int \tilde{p}(x) dx$

Another problem, is that the partition function usually depends on the model parameters, i.e. $Z = Z(\theta)$ for parameters$\theta$. Thus, the gradient of the log-likelihood (in order to maximize the likelihood) decomposes into the so-called positive and negative phases:

$\nabla_\theta \log p(x;\theta) = \nabla_\theta \log \tilde{p}(x;\theta) - \nabla_\theta \log Z(\theta)$

Under certain regularity conditions (see the textbook for details), which can usually be assumed to hold for machine learning models, the gradient can be rewritten as

$\nabla_\theta \log Z = E_{x\sim p(x)}[\nabla_\theta \log \tilde{p}(x)]$

The derivation for the discrete case is rather straight-forward by computing the derivative of $\log⁡(Z)$ and assuming that $p(x) > 0$ for all $x$. This identity is the basis for various Monte Carlo based methods for maximizing the likelihood. Also note that the two phases have intuitive interpretations. In the positive phase, $\log⁡(\tilde{p}(x))$ is increased for $x$ drawn from the data; in the negative phase, the partition function $Z$ is decreased by decreasing $\log⁡(\tilde{p}(x))$ for $x$ drawn from the model distribution.

Following this brief introduction, Goodfellow et al. focus on the discussion of contrastive divergence. In general, contrastive divergence is based on a naive Markov Chain Monte Carlo algorithm for maximizing the likelihood: For the positive phase, samples from the data set are used to compute $\nabla_\theta \log(\tilde{p}(x;\theta))$; for the negative phase, a Markov Chain is burned in to provide samples $\{\tilde{x}_1,\ldots , \tilde{x}_M\}$ used to estimate $\nabla_\theta \log(Z)$ as

$\frac{1}{M} \sum_{i = 1}^M \nabla_\theta \log \tilde{p}(\tilde{x}^{(i)}; \theta)$

As burning in a Markov Chain, initialized at random, in each iteration is rather computational expensive, contrastive divergence initializes the Markov Chain using samples from the data set. This approach is summarized in Algorithm 1. Note that Gibbs Sampling is used (see gibbs_sampling in Algorithm 1, details can be found in the textbook, Chapter 17).

contrastive_divergence(
        $\epsilon$ // step size
        $k$ // number of Gibbs steps
    )
    while not converged
        sample a minibatch $\{x^{(1)},\ldots,x^{(m)}\}$
        $g := \frac{1}{m} \sum_{i = 1}^m \nabla_\theta \log \tilde{p}(x^{(i)}, \theta)$
        for $i = 1,\ldots,M$
            \tilde{x}^{(i)} := x^{(i)}
        for $i = 1,\ldots,k$
            for $j = 1,\ldots,m$
                $\tilde{x}^{(j)} := $gibbs_update($\tilde{x}^{(j)}$)
        $g := g - \frac{1}{m} \sum_{i = 1}^m \nabla_\theta \log \tilde{p}(\tilde{x}^{(i)}, \theta)$
        $\theta := \theta + \epsilon g$

This results in a convenient approximation to the expensive negative phase. However, it also introduces the problem of spurious modes - regions with low probability under the data distribution but high probability under the model distribution. Unless a lot of iterations are performed, a Markov Chain initialized with samples from the data set will usually not reach these spurious models. Thus, the negative phase fails to suppress these regions. Overall, it might be likely to get samples that do not resemble the data.

While contrastive divergence implicitly approximates the partition function (without providing an explicit estimate of it), other methods try to avoid this estimation problem. Goodfellow et al. discuss approaches including pseudo-likelihood, score matching and noise-contrastive estimation. For example, the main idea behind pseudo-likelihoods is that the partition function is not needed when considering ratios of probabilities:

$\frac{p(x)}{p(y)} = \frac{\frac{1}{Z}\tilde{p}(x)}{\frac{1}{Z}\tilde{p}(y)} = \frac{\tilde{p}(x)}{\tilde{p}(y)}$

The same is applicable to conditional probabilities. These considerations result in the pseudo-likelihood objective:

$\sum_{i = 1}^n \log p(x_i | x_{-i})$

where $x_{-i}$ corresponds to all variables except for $x_i$. Goodfellow et al. note that maximizing the pseudo-likelihood is asymptotically consistent with maximizing the likelihood. The remaining discussed approaches often use similar approaches for avoiding the partition function. Details can be found in the chapter.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.

IAM

DAVIDSTUTZ

READING

SEARCHTHEBLOG

ARCHIVES

TAGS