F. Perronnin, C. R. Dance. Fisher kernels on visual vocabularies for image categorization. In Computer Vision and Pattern Recognition, Conference on, pages 1–8, Minneapolis, Minnesota, June 2007.

Perronnin and Dance use Fisher Vectors [1] for image categorization. However, the approach is often used for image retrieval, as for example in [2]. Fisher vectors area easily motivated when considering a Gaussian mixture model for the extracted descriptor $y_{l,n}$ in image $n$:

$p(y_{l,n}) = \sum_{m = 1}^M w_m \mathcal{N}(y_{l,n}|\mu_m, \Sigma_m)$,    $\sum_{m = 1}^M w_m = 1$

where $\mathcal{N}(y_{l,n}|\mu_m, \Sigma_m)$ denotes a Gaussian with mean $\mu_m$ and covariance $\Sigma_m$. The model is learned on $Y = \bigcup_{n = 1}^N Y_n$, the set of all local descriptors extracted from the images $n = 1,\ldots,N$, using the Expectation Maximization algorithm. The idea of Fisher vectors is to characterize a local descriptor $y_{l,n}$ by the following gradient:

$\nabla_{\mu_m} \log(p(y_{l,n}))$.

intuitively, this characterizes each descriptor by the direction in which the descriptor should be adapted to better fit the Gaussian model. Taking into account all local descriptors $Y_n$ of image $n$, which are assumed to be independent, the log-likelihood can be written as

$\log(p(Y_n)) = \sum_{l = 1}^L \log(p(y_{l,n}))$.

The partial derivative of the log-likelihood with respect to the mean $\mu_m$ is given as

$\sum_{l = 1}^L \gamma_m(y_{l,n}) \Sigma_m^{-1}(y_{l,n} - \mu_m)$,    $\gamma_m(y_{l,n}) = \frac{w_m \mathcal{N}(y_{l,n}|\mu_m,\Sigma_m)}{\sum_{m' = 1}^M w_{m'} N(y_{l,n}|\mu_{m'},\Sigma_{m'})}$

In practice, the covariance $\Sigma_m$ is asumed to be diagonal, that is $\Sigma_m = diag(\sigma_{1,m}^2,\ldots,\sigma_{c,m}^2)$ where $c$ is the dimensionality of the descriptors. Further, the gradient vectors are normalized using the Fisher information matrix

$Z = \mathbb{E}_{Y_n}[\nabla \log(p(Y_n))\nabla \log(p(Y_n))^T]$

for which Perronnin et al. derive the following approximation:

$Z_{\mu_m}^{-1}\nabla_{\mu_m} \log(p(y_{l,n}))$ with $Z_{\mu_m} = \frac{Lw_m}{\sigma_m^2}$

where the inversion as well as the division is meant element-wise. Based on the above derivation, each descriptor is embedded as

$f(y_{l,n}) = \left(Z_{\mu_1}^{-1}\nabla_{\mu_1} p(y_{l,n}),\ldots,Z_{\mu_M}^{-1}\nabla_{\mu_M} p(y_{l,n})\right)$.

These vectors are then aggregate by summing over all $f(y_{l,n})$ for a specific image $n$. The result is usually power-law normalized.

  • [1] T. S. Jaakkola, D.Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems, pages 487–493, Denver, Colorado, November 1999.
  • [2] A. Gordo, J. A. Rodríguez-Serrano, F. Perronnin, E. Valveny. Leveraging category-level labels for instance-level image retrieval. In Computer Vision and Pattern Recognition, Conference on, pages 3045–3052, Providence, Rhode Island, June 2012.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.