"Understanding deep image representations by inverting them", Mahendran and Vedaldi • David Stutz

SEPTEMBER2017

READING

Aravindh Mahendran, Andrea Vedaldi. Understanding deep image representations by inverting them. CVPR, 2015.

Mahendran and Vedaldi propose a visualization technique allowing to visualize higher level features within deep representations. Essentially, the idea is to compute a reconstruction (based on an adequate image prior) which most closely results in the given representation. The approach is applied to AlexNet [] as well as convolutional neural networks mimicking DSIFT [][] and HoG [].

The underlying optimization problem takes the form

$x^\ast = \arg\min_{x\in \mathbb{R}^{H \times W \times C}} l(\Phi(x), \Phi_0) + \lambda \mathcal{R}(x)$

where $\Phi(x)$ refers to the representation obtained on image $x$, and $\Phi(x_0) = \Phi_0$ is the representation about to be visualized. The loss is the Euclidean distance and as regularization, Mahendran and Vedaldi use a combination of the $\alpha$-norm

$\mathcal{R}_\alpha (x) = \|x\|_\alpha^\alpha$

and total variation (in its discrete form):

$\mathcal{R}_{V^\beta}(x) = \sum_{i,j}\left((x_{i,j + 1} - x_{i,j})^2 + (x_{i + 1,j} - x_{i,j})^2\right)^{\frac{\beta}{2}}$

regarding the balance of these three terms, some caveats need to be considered. First, the Euclidean distance is normalized by $\|\Phi_0\|_2^2$. Furthermore, $\Phi(x)$ is replaced by a scaled version $\Phi(\sigma x)$ in order to address the first convolutional layers being not completely insensitive to scaling. $\sigma$ is set to the average Euclidean norm of the images. The final objective takes the form

$\|\Phi(\sigma x) - \Phi_0\|_2^2/\|\Phi_0\|_2^2 + \lambda_\alpha \mathcal{R}_alpha(x) + \lambda_{V^\beta}\mathcal{R}_{V^\beta}(x)$.

With appropriate weighting parameters as detailed in the paper. The objective is minimized using gradient descent with momentum.

For experiments, they consider AlexNet [] as well as convolutional neural networks reconstructingDSIFT [17,20] and HoG [4]. In particular, they detail how DSIFT and HoG can be expressed as convolutional neural networks by converting the individual operations to commonly used layers. Details can be found in the paper.

Qualitative results in Figure 1 show the reconstruction of an input image for representations obtained from the different layers in AlexNet. Note that for each layer, the weighting parameters are chosen separately.

Figure 1: Reconstructions from each layer in AlexNet.

[] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
[] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, 1999.
[] E. Nowak, F. Jurie, and B. Triggs. Sampling strategies for bag-of-features image classification. In ECCV, 2006.
[] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.

IAM

DAVIDSTUTZ

READING

SEARCHTHEBLOG

ARCHIVES

TAGS