COMPUTERVISION RESEARCHSCIENTIST

RESEARCHSCIENTIST

27^{th}NOVEMBER2015

F. Perronnin, C. R. Dance. *Fisher kernels on visual vocabularies for image categorization*. In Computer Vision and Pattern Recognition, Conference on, pages 1–8, Minneapolis, Minnesota, June 2007.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below or get in touch with me:

Perronnin and Dance use Fisher Vectors [1] for image categorization. However, the approach is often used for image retrieval, as for example in [2]. Fisher vectors area easily motivated when considering a Gaussian mixture model for the extracted descriptor $y_{l,n}$ in image $n$:

$p(y_{l,n}) = \sum_{m = 1}^M w_m \mathcal{N}(y_{l,n}|\mu_m, \Sigma_m)$, $\sum_{m = 1}^M w_m = 1$

where $\mathcal{N}(y_{l,n}|\mu_m, \Sigma_m)$ denotes a Gaussian with mean $\mu_m$ and covariance $\Sigma_m$. The model is learned on $Y = \bigcup_{n = 1}^N Y_n$, the set of all local descriptors extracted from the images $n = 1,\ldots,N$, using the Expectation Maximization algorithm. The idea of Fisher vectors is to characterize a local descriptor $y_{l,n}$ by the following gradient:

$\nabla_{\mu_m} \log(p(y_{l,n}))$.

intuitively, this characterizes each descriptor by the direction in which the descriptor should be adapted to better fit the Gaussian model. Taking into account all local descriptors $Y_n$ of image $n$, which are assumed to be independent, the log-likelihood can be written as

$\log(p(Y_n)) = \sum_{l = 1}^L \log(p(y_{l,n}))$.

The partial derivative of the log-likelihood with respect to the mean $\mu_m$ is given as

$\sum_{l = 1}^L \gamma_m(y_{l,n}) \Sigma_m^{-1}(y_{l,n} - \mu_m)$, $\gamma_m(y_{l,n}) = \frac{w_m \mathcal{N}(y_{l,n}|\mu_m,\Sigma_m)}{\sum_{m' = 1}^M w_{m'} N(y_{l,n}|\mu_{m'},\Sigma_{m'})}$

In practice, the covariance $\Sigma_m$ is asumed to be diagonal, that is $\Sigma_m = diag(\sigma_{1,m}^2,\ldots,\sigma_{c,m}^2)$ where $c$ is the dimensionality of the descriptors. Further, the gradient vectors are normalized using the Fisher information matrix

$Z = \mathbb{E}_{Y_n}[\nabla \log(p(Y_n))\nabla \log(p(Y_n))^T]$

for which Perronnin et al. derive the following approximation:

$Z_{\mu_m}^{-1}\nabla_{\mu_m} \log(p(y_{l,n}))$ with $Z_{\mu_m} = \frac{Lw_m}{\sigma_m^2}$

where the inversion as well as the division is meant element-wise. Based on the above derivation, each descriptor is embedded as

$f(y_{l,n}) = \left(Z_{\mu_1}^{-1}\nabla_{\mu_1} p(y_{l,n}),\ldots,Z_{\mu_M}^{-1}\nabla_{\mu_M} p(y_{l,n})\right)$.

These vectors are then aggregate by summing over all $f(y_{l,n})$ for a specific image $n$. The result is usually power-law normalized.Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems, pages 487–493, Denver, Colorado, November 1999.Leveraging category-level labels for instance-level image retrieval. In Computer Vision and Pattern Recognition, Conference on, pages 3045–3052, Providence, Rhode Island, June 2012.