H. Jégou, M. Douze, C. C. Schmid, P. Pérez. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition, Conference on, pages 3304–3311, San Fransisco, California, June 2010.

Similar to the Bag of Visual Words model [1], Jégou et al. approach the problem of image retrieval by first computing a vocabulary $M$ of visual words learned using $k$-means clustering. Instead of counting word occurrences, they consider the corresponding residuals:

$f(y_{l,n}) = (\delta(NN_{\hat{Y}}(y_{l,n}) = \hat{y}_1)(y_{l,n} - \hat{y}_1), \ldots, \delta(NN_{\hat{Y}}(y_{l,n}) = \hat{y}_M)(y_{l,n} - \hat{y}_M))$

where $\hat{Y}= \{\hat{y}_1, \ldots, \hat{y}_M\}$ is the set of learned visual words and $y_{l,n}$ is the $l$-th feature extracted from image $n$. These so-called embeddings are then aggregated using

$F(Y_n) = \sum_{l = 1}^L f(y_{l,n})$.

The image representation $F(Y_n)$ is usually $L_2$-normalized.

  • [1] J. Sivic, A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Computer Vision, International Conference on, pages 1470–1477, Nice, France, October 2003.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.