P. Krähenbühl, V. Koltun. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. NIPS, 2011.

Krähenbühl and Koltun introduce a popular algorithm for inference in dense random fields under the assumption of Gaussian edge potentials. In particular, the assumed model has the form

$E(x) = \sum_i \psi_u(x_i) + \sum_{i < j} \psi_p(x_i, x_i)$

with unary potentials $\psi_u$ and pairwise potentials $\psi_p$. Here, $x_i$ denotes the label assigned to pixel $i$. The unary potential is usually modeled by a pixel-wise classifier, while the pairwise potential is modeled using a mixture of Gaussian kernels multiplied by a label compatibility function:

$\psi_p(x_i,x_j) = \mu(x_i,x_j) \sum_{m = 1}^K w^{(m)} k^{(m)}(f_i, f_j)$

where $k^{(m)}$ is a Gaussian kernel comparing the feature vectors $f_i$ and $f_j$ of the corresponding pixels. In practice, it is computed as a weighted sum of Gaussian kernels applied on color and spatial coordinates, where the variances used are parameters that need to be learned. As compatibility function, a Potts model can be used, however, Krähenbühl and Koltun learn a symmetric function instead. This allows to take the interactions between labels into account.

Based on this random field, their main contribution is an efficient message passing algorithm based on the mean field approximation allowing for efficient inference. Using the mean field approximation derived by Koller and Friedmann [1], the derived update equation looks as follows:

$Q_i(x_i = l) = \frac{1}{Z_i} \exp\left\{-\psi_u(x_i) - \sum_{l' \in \mathcal{L}} \mu(l, l') \sum_{m = 1}^K w^{(m)} \sum_{i \neq j} k^{(m)} (f_i, f_j)Q_j(l')\right\}$(1)

The derivation is based on modeling a distribution $Q(x) = \prod_i Q(x_i)$ to minimize the KL-divergence

$D(Q|P) = \sum_{x}Q(x) \log \frac{Q(x)}{P(x)}$

(following Chapter 11.5 of Koller and Friedmann [1] and then substituting the definition of the pairwise potentials). In Equation (1), $Z_i$ denotes the partition function corresponding to $Q_i$ and $\mathcal{L}$ is the set of interesting labels. Also see the supplementary material of the paper for details. This scheme can easily be implemented as can be seen in Algorithm 1.

initialize $Q$
for $t = 1,\ldots$
    for $m = 1,\ldots,K$
        $\hat{Q}_i^{(m)}(l) = \sum_{j \neq i} k^{(m)} (f_i, f_j)Q_j(l)$
    $\hat{Q}_i(x_i) = \sum_{l \in \mathcal{L}} \mu^{(m)}(x_i, l) \sum_m w^{(m)} \hat{Q}_i^{(m)}(l)$
    $Q_i(x_i) = \exp\{-\psi_u(x_i) - \hat{Q}_i(x_i)\}$
    normalize $Q_i(x_i)$

Algorithm 1: Mean field in fully connected CRFs.

However, due to the message passing step, the complexity of a naive implementation is quadratic in the number of pixels. The main idea employed by Krähenbühl and Koltun is that the message passing step can be modeled by convolving $Q$ with a Gaussian (as $k^{(m)}$ corresponds to a Gaussian kernel):

$\hat{Q}_i^{(m)}(l) = \sum_{j \in V} k^{(m)} (f_i, f_j) Q_j(l) - Q_i(l) = [G_{\Lambda^{(m)}} \otimes Q(l)](f_i) - Q_i(l)$(2)

By replacing the message passing step with Equation (2) the computational complexity can be reduced to be linear based on the following observations:

  • Due to the sampling theorem, the convolution can be performed on a downsampled version of $Q$. This is due to the fact that Gaussian filtering performs low-pass filtering which implies that $Q$ can be reconstructed based on samples spaced proportionally to the variance used.
  • When using a truncated Gaussian filter, only a constant number of neighbors are needed for the computation (as the spacing of the samples is proportional to the standard deviation).

When using the indicator function as label compatibility function, Krähenbühl and Koltun do only learn the unary potentials using JointBoost [2]. The remaining parameters, including the variances, are set using discrete grid search on a validation set. Unfortunately, they do not mention how to learn the weights in the Gaussian mixture model used for the pairwise potentials.

  • [1] D. Koller, N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
  • [2] A. Torralba, K. P. Murphy, W. T. Freeman. Sharing visual features for multiclass and multiview object detection. PAMI, 2007.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.