IAM

JANUARY2018

READING

Gernot Riegler, Matthias Rüther, Horst Bischof. ATGV-Net: Accurate Depth Super-Resolution. ECCV, 2016.

Riegler et al. discuss a bi-level optimization problem for depth super resolution where the lower level is unrolled, i.e. the operations are expressed as network layers, such that the problem can be solved end-to-end. Given a training set $\{(s_k,t_k )\}_{k=1}^K$ of bilinearly upsampled depth images $s_k$ and high resolution ground truth depth images $t_k$, the task is to predict the residual between $s_k$ and $t_k$. To this end, the following bi-level optimization problem is considered:

$\min_w \frac{1}{K} \sum-{k = 1}^K L(u^*(f(w, s_k)), t_k)$

s.t. $u^*(f(w, s_k)) = \arg\min_w E(w; f(w, s_k))$.

where L$$ describes the Euclidean loss and the upper-level, therefore, represents the Euclidean loss on the training set. The lower-level problem is characterized by the error $E$ which Riegler et al. define by

$E(u;f(w,s_k)) = R(u, h(w_h, s_k)) + \frac{e^{w_\lambda}}{2} \|u - g(w_g, s_k)\|_2^2$

$R(u, h(w_h, s_k)) = \min_v \alpha_1 \|T(h(w_h, s_k))(\nabla_u u - v)\|_1 + \alpha_0 \|\nabla_v v\|_1$

where the latter is motivated by the total generalized variation in order to favor piece-wise affine solutions. Overall, $f(w,s_k )=[h(w_h, s_k ),w_λ,g(w_g,s_k )]$ is parameterized by two deep networks $h$ and $g$ and the weighting parameter $w_λ$. Further, $T$ is an anisotropic diffusion tensor based on teh Nagel-Enkelmann operator []:

$T(h(w_h, s_k)) = \exp(-\beta \|h(w_h, s_k)\|_2^\gamma) nn^T + n_\bot n_\bot^T$

where $n$ is the gradient normal of $h$, and $\beta$ and $\gamma$ are weighting parameters. Due to the used $L_1$ norms, the problem is non-smooth. To approach the minimization of E$$, Riegler et al. use the primal-dual algorthm proposed by Chanbolle and Pock in [] (see here for reading notes). Shortly summarized, the algorithm approaches saddle-point problems of the form

$\min_x \max_y \langle Kx, y\rangle + G(x) - F^*(y)$

which can be seen as primal-dual version of

$\min_x F(Kx) + G(x)$

They further derive a first-order iterative algorithm based on the iterative update equations:

$y^{n + 1} = (I + \sigma \partial F^*)^{-1}(y^n + \sigma K \bar{x}^n)$

$x^{n + 1} = (I + \tau \partial G)^{-1}(x^n - \tau K^* y^{n + 1})$

$\bar{x}^{n + 1} = x^{n + 1} + \theta(x^{n + 1} - x^n)$

with $\partial F$ denoting the subgradient of $F$ (analogously for $G$).

Riegler et al. first reformulate $E$ in the necessary saddle-point form. To this end, it is useful to know the conjugate of the $L_1$ norm (see for example these lecture notes by Tibshirani). Applied to $R$, this results in the overall problem:

where both the scalar products and the constraints come from subsituting the conjugate of the $L_1$ norm. Applying the update equations results in

With $\text{proj}(p)=\frac{p}{\max⁡(1,\|p\|_2)}$. However, to understand the general idea of their approach, the derivation of the update equations is secondary. More important is that these steps can be unrolled wit the idea of a recurrent neural network in mind. To this end, Riegler et al. discuss how the individual operations can be expressed in common neural network layers. When expressing the derivatives as finite differences, all operations are either a composition of point-wise operations, or can be expressed as convolutions with Sobel filters to compute the derivatives.

The neural network $g$ directly models an initial estimate of the high resolution depth map by modeling the residual as described above. The neural network $h$ is used to weight the pairwise regularization term. Therefore, $h$ is trained to model the gradient of the high resolution target. A combined Euclidean loss is used for training:

$L_p(\{s_k, t_k\}_{k = 1}^K) = \frac{1}{K} \sum_{k = 1}^K \|g(w_g, s_k) - t_k\|_2^2 + \|h(w_h, s_k) - \nabla t_k\|_2^2$

First, $g$ and $h$ are initialized by several iterations of gradient descent with momentum directly minimizing this objective. Second, the unrolled optimization steps from the first-order primal-dual algorithm are stacked on top and the full network is fine tuned on the same objective. Partly training on synthetic data, the presented results look promising, although the difference between the ATGV-net and a general CNN network for depth superreoslution is hard to observe. The results are presented in Figure 1.


Figure 1: Results of depth super resolution using the proposed ATGV-Net.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.