Krähenbühl and Koltun built on their earlier work [] (find my reading notes here) where they introduced an efficient inference algorithm for dense random fields (i.e. fully connected CRFs) based on mean field inference. In particular, the model
and learned unary potentials. Here $x_i$ refers to the label corresponding to pixel $i$ while $f_i$ being to the corresponding feature vector (e.g. color and coordinates). The label compatibility function can be defined or learned. The mean field approximation minimizes (see []) the KL divergence
based on approximating the distribution $P(x)=\frac{1}{Z(I,θ)} \exp(−E(x│I,\theta))$ with $Q(x)= \prod_i Q_i(x_i)$ to derive the basic message passing algorithm given by
As the sum in the above Equation requires $\mathcal{O}(N^2)$ operations (with $N$ being the number of pixels), Krähenbühl and Koltun introduce a simple convolution-based scheme to approximately compute the messages, running in linear time.
In this paper, Krähenbühl and Koltun further build on this scheme by presenting a variant that guarantees convergence. For this to work, they make the following two assumptions:
The used kernels $k^{(m)}$ are positive definite (i.e. have positive definite kernel matrix);
And the used label compatibility functions $\mu^{(m)}$ are negative semi-definite (i.e. a $c$ exists such that the matrix with entries $\mu_ij^{(m)}=\mu^{(m)}(l,l') + c$ is negative semi-definite).
Under these assumptions, the objective of the KL divergence is rewritten in terms of the vector $q_{i + l} = Q_i(x_i = l) \in \mathbb{R}^{N\cdot M}$ where $M$ is the number of labels and $u_{i + l} = \psi_i(x_i = l)$:
Here, $K^{(m)}$ is the kernel matrix, $I$ is the identity matrix, $\mu^{(m)}$ is the label compatibility matrix mentioned previously. $\otimes$ refers to the Kronecker product. Note that $\Psi$ is of size $NM \times NM$ such that the dimensions in Equation (1) match! The idea is then to decompose Equation (1) into a sum of a convex function $f(q)$ and a concave function $g(q)$. Substituting Equation (2) into Equation (1) and splitting the subtraction in the sum in Equation (2) results in the following decomposition:
The argumentation is that $K^{(m)} \otimes \mu^{(m)}$ is negative semi-definite due to both assumptions and $g(q)$, thus, concave. Conversely, $−I \otimes \mu^{(m)}$ is positive definite and $f(q)$, thus, convex. They apply the convex-concave procedure (CCCP) [] to minimize the KL divergence.
Finally, Krähenbühl and Koltun also provide an algorithm for learning all the involved parameters. Unfortunately, the derivation is more than hard to follow such that I was only able to comprehend the derivation with some extra sheets of empty paper. For details see the paper.
[] Philipp Krähenbühl, Vladlen Koltun. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. NIPS, 2011.
[] Alan L. Yuille, Anand Rangarajan. The Concave-Convex Procedure (CCCP). NIPS, 2001.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.
Krähenbühl and Koltun built on their earlier work [] (find my reading notes here) where they introduced an efficient inference algorithm for dense random fields (i.e. fully connected CRFs) based on mean field inference. In particular, the model
$E(x|I,\theta) = \sum_i \psi_i(x_i|\theta) + \sum_{i < j} \psi_{ij}(x_i,x_j|\theta)$
with Gaussian mixture pairwise models
$\psi_{ij}(x_i,x_j|\theta) = \sum_{m = 1}^C \mu^{(m)}(x_i,x_j|\theta)k^{(m)}(f_i - f_j)$
and learned unary potentials. Here $x_i$ refers to the label corresponding to pixel $i$ while $f_i$ being to the corresponding feature vector (e.g. color and coordinates). The label compatibility function can be defined or learned. The mean field approximation minimizes (see []) the KL divergence
$D(Q|P) = \sum_{i,x_i} Q_i(x_i)\log Q_i(x_i)$
$+ \sum_{i,x_i}\psi_i(x_i|\theta)Q_i(x_i) + \sum_{i < j}\sum_{x_i,x_j} \psi_{ij}(x_i,x_j|\theta)Q_i(x_i)Q_j(x_j) + \log Z(I,\theta)$
based on approximating the distribution $P(x)=\frac{1}{Z(I,θ)} \exp(−E(x│I,\theta))$ with $Q(x)= \prod_i Q_i(x_i)$ to derive the basic message passing algorithm given by
$Q_i(x_i) = \frac{1}{Z_i} \exp\left(\psi_i(x_i|\theta) - \sum_{j \neq i}\sum_{x_j} \psi_{ij}(x_i,x_j|\theta)Q_j(x_j)\right)$.
As the sum in the above Equation requires $\mathcal{O}(N^2)$ operations (with $N$ being the number of pixels), Krähenbühl and Koltun introduce a simple convolution-based scheme to approximately compute the messages, running in linear time.
In this paper, Krähenbühl and Koltun further build on this scheme by presenting a variant that guarantees convergence. For this to work, they make the following two assumptions:
Under these assumptions, the objective of the KL divergence is rewritten in terms of the vector $q_{i + l} = Q_i(x_i = l) \in \mathbb{R}^{N\cdot M}$ where $M$ is the number of labels and $u_{i + l} = \psi_i(x_i = l)$:
$D(Q|P) = q^T \log q + q^T u + \frac{1}{2}q^T \Psi q + \log Z(I,\theta)$(1)
where the matrix $\Psi$ is defined as follows:
$\Psi = \sum_{m = 1}^C (K^{(m)} - I) \otimes \mu^{(m)}$(2)
Here, $K^{(m)}$ is the kernel matrix, $I$ is the identity matrix, $\mu^{(m)}$ is the label compatibility matrix mentioned previously. $\otimes$ refers to the Kronecker product. Note that $\Psi$ is of size $NM \times NM$ such that the dimensions in Equation (1) match! The idea is then to decompose Equation (1) into a sum of a convex function $f(q)$ and a concave function $g(q)$. Substituting Equation (2) into Equation (1) and splitting the subtraction in the sum in Equation (2) results in the following decomposition:
$f(q) = q^T \log q - \frac{1}{2} q^T \left(\sum_{m = 1}^C I \otimes \mu^{(m)}\right)q$
$g(q) = q^T U + \frac{1}{2}q^T\left(\sum_{m = 1}^C K^{(m)} \otimes \mu^{(m)}\right)q + \log Z(I,\theta)$
The argumentation is that $K^{(m)} \otimes \mu^{(m)}$ is negative semi-definite due to both assumptions and $g(q)$, thus, concave. Conversely, $−I \otimes \mu^{(m)}$ is positive definite and $f(q)$, thus, convex. They apply the convex-concave procedure (CCCP) [] to minimize the KL divergence.
Finally, Krähenbühl and Koltun also provide an algorithm for learning all the involved parameters. Unfortunately, the derivation is more than hard to follow such that I was only able to comprehend the derivation with some extra sheets of empty paper. For details see the paper.