J. Mairal, F. Bach, J. Ponce. Sparse Modeling for Image and Vision Processing. Foundations and Trends in Computer Graphics and Vision, colume 8, 2014.

Mairal, Bach and Ponce provide a profound discussion of sparse coding for computer vision. Beneath discussing data pre-processing and data analysis methods such as PCA and ICA, they also provide detailed descriptions of sparse coding and dictionary learning algorithms including their applications such as denoising, inpainting, super-resolution and recognition.

Personally, I was most interested in data pre-processing as well as gradient-descent based sparse-coding and dictionary learning approaches.

Data pre-processing. Mairal, Bach and Ponce discuss the following pre-processing techniques: centering, normalization and whitening.

  • Centering: Assuming image patches $x_1,\ldots,x_N \in \mathbb{R}^n$, centering is computed as

    $x_i := x_i - \frac{1}{N}\sum_{i = 1}^N x_i$

    When performed on overlapping patches, every pixel occurs in multiple patches. Therefore, the corresponding values are averaged. According to Mairal, Bach and Ponce this has a similar effect as high-pass filtering the image.
  • Contrast normalization: Interestingly, naive $l_2$-normalization, i.e.

    $x_i := \frac{x_i}{\|x_i\|_2}$

    may give poor results when applied on image patches. Especially after centering, some patches may contain few information resulting in a low $l_2$ norm. Therefore, Mairal, Bach and Ponce use

    $x_i := \frac{x_i}{\max(\eta, x_i)}$

    with $\eta$ chosen as $0.2$ times the mean $l_2$ norm across all patches. On overlapping patches, the same approach as above is employed.
  • Whitening: Given the covariance of the image patches, i.e.

    $\Sigma = \frac{1}{N}\sum_{i = 1}^N x_i x_i^T$,

    whitening intends to transform the image patches such that the covariance matrix is close to the identity matrix. As the covariance is positive semi-definite, the eigenvalues are non-negative and $S = \text{diag}(s_1,\ldots, s_n)$ of the eigenvalue decomposition

    $\Sigma = U S^2 U^T$

    holds the singular values (note that $U$ is orthogonal). Whitening is the performed as

    $ x_i := U S' U^T$

    with $S' = \text{diag}(s_1', \ldots, s_n')$ and

    $s_i' = \begin{cases}\frac{1}{s_i} &\quad \text{ if } |s_i| > \epsilon\\ 0 &\quad \text{ otherwise }\end{cases}$

    for some $\epsilon > 0$.

For applying the discussed pre-processing techniques to color patches, Mairal, Bach and Ponce make the following suggestions:

  • The chosen color space inherently influences pre-processing; one of the motivation to consider color spaces such as Lab, XYZ and YCrCb is the fact that the Euclidean distance is not an good estimate of the perceived color differences when applied in RGB.
  • Different channels of color images should be centered separately.

Sparse coding. Mairal, Bach and Ponce discuss several sparse coding approaches with respect to both the $l_0$ norm and the $l_1$ norm, i.e. considering the problems

$\min_{\alpha \in \mathbb{R}^m} \|x - D\alpha\|_2^2 + \lambda \|\alpha\|_0$(1)

$\min_{\alpha \in \mathbb{R}^m} \|x - D\alpha\|_2^2 + \lambda \|\alpha\|_1$(2)

I was most interested in gradient descent approaches to both problems (however, they also discuss approaches such as matching pursuit, orthogonal matching pursuit etc.). For Equation (1) this results in hard thresholding as illustrated in Algorithm 1; for Equation (2), soft thresholding is illustrated in Algorithm 2.

function hard_thresholding(
        $x$, // signal to sparse code
        $D$, // dictionary
        $\eta$, $\tau$ // parameters
    initialize $\alpha^{(0)}$
    for $t = 0,1,\ldots$
        $\alpha^{(t + 1)} := \alpha^{(t)} + \eta D^T(x - D\alpha^{(t)})$
        for $i = 1,\ldots,m$
            $\alpha_i^{(t + 1)} := \begin{cases}\alpha_i^{(t + 1)}&\quad\text{ in case } |\alpha_i^{(t + 1)}| \geq \tau\\0&\quad\text{ otherwise }\end{cases}$

Algorithm 1: Hard thresholding for sparse coding.

function soft_thresholding(
        $x$, // signal to sparse code
        $D$, // dictionary
        $\eta$, $\lambda$ // parameters
    initialize $\alpha^{(0)}$
    for $t = 0,1,\ldots$
        $\alpha^{(t + 1)} := \alpha^{(t)} + \eta D^T(x - D\alpha^{(t)})$
        for $i = 1,\ldots,m$
            $\alpha_i^{(t + 1)} := \text{sign}(\alpha_i^{(t+1)})\max(|\alpha_i^{(t + 1)}| - \lambda, 0)$

Algorithm 2: Soft thresholding for sparse coding.

Dictionary learning. For dictionary learning, Mairal, Bach and Ponce discuss several approaches including K-SVD, block coordinate descent and alternating minimization. Again, I focussed on a (stochastic) gradient descent approach addressing the problem

$\min_{D \in \mathbb{R}^{n \times m}} \frac{1}{N}\sum_{i = 1}^N \min_{\alpha \in \mathbb{R}^m} \frac{1}{2} \| x_i - D \alpha\|_2^2 + \lambda \|\alpha\|_1$

Alternatively, the $l_1$ norm can be replaced by the $l_0$ norm. The resulting algorithm is shown in Algorithm 3 where the last step in each iteration describes a projection of $D$ onto the set of matrices with $l_2$ norm less than one. This prevents the dictionary columns to grow out of bounds.

function soft_thresholding(
        $x_1,\ldots,x_N$, // signals
        $\eta$, $\lambda$ // parameters
    initialize $D^{(0)}$
    for $t = 0,1,\ldots$
        select $x_i$ randomly
        soft_thresholding($x_i$, $D^{(t)}$, $\eta$, $\lambda$) // alternatively, a separate $\lambda$ can be used
        $D^{(t+1)} = D^{(t)} - \eta (D^{(t)} \alpha - x_i) x_i^T$
        for $i = 1, \ldots, n$
            for $j = 1,\ldots, m$
                $d_{i,j}^{(t + 1)} := \frac{1}{\max(\|d_j^{(t + 1)}\|_2, 1)}$ // $d_j^{(t + 1)}$ is column $j$

Algorithm 3: Stochastic gradient descent for dictionary learning.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.