27^{th}SEPTEMBER2017

Philipp Krähenbühl, Vladlen Koltun. *Parameter Learning and Convergent Inference for Dense Random Fields*. ICML, 2013.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below or get in touch with me:

Krähenbühl and Koltun built on their earlier work [] (find my reading notes here) where they introduced an efficient inference algorithm for dense random fields (i.e. fully connected CRFs) based on mean field inference. In particular, the model

$E(x|I,\theta) = \sum_i \psi_i(x_i|\theta) + \sum_{i < j} \psi_{ij}(x_i,x_j|\theta)$

with Gaussian mixture pairwise models

$\psi_{ij}(x_i,x_j|\theta) = \sum_{m = 1}^C \mu^{(m)}(x_i,x_j|\theta)k^{(m)}(f_i - f_j)$

and learned unary potentials. Here $x_i$ refers to the label corresponding to pixel $i$ while $f_i$ being to the corresponding feature vector (e.g. color and coordinates). The label compatibility function can be defined or learned. The mean field approximation minimizes (see []) the KL divergence

$D(Q|P) = \sum_{i,x_i} Q_i(x_i)\log Q_i(x_i)$

$+ \sum_{i,x_i}\psi_i(x_i|\theta)Q_i(x_i) + \sum_{i < j}\sum_{x_i,x_j} \psi_{ij}(x_i,x_j|\theta)Q_i(x_i)Q_j(x_j) + \log Z(I,\theta)$

based on approximating the distribution $P(x)=\frac{1}{Z(I,θ)} \exp(−E(x│I,\theta))$ with $Q(x)= \prod_i Q_i(x_i)$ to derive the basic message passing algorithm given by

$Q_i(x_i) = \frac{1}{Z_i} \exp\left(\psi_i(x_i|\theta) - \sum_{j \neq i}\sum_{x_j} \psi_{ij}(x_i,x_j|\theta)Q_j(x_j)\right)$.

As the sum in the above Equation requires $\mathcal{O}(N^2)$ operations (with $N$ being the number of pixels), Krähenbühl and Koltun introduce a simple convolution-based scheme to approximately compute the messages, running in linear time.

In this paper, Krähenbühl and Koltun further build on this scheme by presenting a variant that guarantees convergence. For this to work, they make the following two assumptions:

Under these assumptions, the objective of the KL divergence is rewritten in terms of the vector $q_{i + l} = Q_i(x_i = l) \in \mathbb{R}^{N\cdot M}$ where $M$ is the number of labels and $u_{i + l} = \psi_i(x_i = l)$:

$D(Q|P) = q^T \log q + q^T u + \frac{1}{2}q^T \Psi q + \log Z(I,\theta)$(1)

where the matrix $\Psi$ is defined as follows:

$\Psi = \sum_{m = 1}^C (K^{(m)} - I) \otimes \mu^{(m)}$(2)

Here, $K^{(m)}$ is the kernel matrix, $I$ is the identity matrix, $\mu^{(m)}$ is the label compatibility matrix mentioned previously. $\otimes$ refers to the Kronecker product. Note that $\Psi$ is of size $NM \times NM$ such that the dimensions in Equation (1) match! The idea is then to decompose Equation (1) into a sum of a convex function $f(q)$ and a concave function $g(q)$. Substituting Equation (2) into Equation (1) and splitting the subtraction in the sum in Equation (2) results in the following decomposition:

$f(q) = q^T \log q - \frac{1}{2} q^T \left(\sum_{m = 1}^C I \otimes \mu^{(m)}\right)q$

$g(q) = q^T U + \frac{1}{2}q^T\left(\sum_{m = 1}^C K^{(m)} \otimes \mu^{(m)}\right)q + \log Z(I,\theta)$

The argumentation is that $K^{(m)} \otimes \mu^{(m)}$ is negative semi-definite due to both assumptions and $g(q)$, thus, concave. Conversely, $−I \otimes \mu^{(m)}$ is positive definite and $f(q)$, thus, convex. They apply the convex-concave procedure (CCCP) [] to minimize the KL divergence.

Finally, Krähenbühl and Koltun also provide an algorithm for learning all the involved parameters. Unfortunately, the derivation is more than hard to follow such that I was only able to comprehend the derivation with some extra sheets of empty paper. For details see the paper.

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. NIPS, 2011.The Concave-Convex Procedure (CCCP). NIPS, 2001.