J. Sivic, A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Computer Vision, International Conference on, pages 1470–1477, Nice, France, October 2003.

Sivic and Zisserman, motivated by early text-retrieval systems, present the so-called bag of visual words model for image retrieval. First, they cluster all local descriptors, denoted by $Y = \bigcup_{n = 1}^N Y_n$ where $Y_n$ are the local descriptors extracted from image $n$ (out of $N$ images), using $k$-means clustering to define a vocabulary of visual words. Subsequently, descriptors are assigned to the nearest visual word and the global image representation is a sparse vector of word counts. Let $\hat{Y} = \{\hat{y}_1,\ldots,\hat{y}_M\}$ be the extracted visual words, then each extracted descriptor $y_{l,n} \in Y_n$ is represented by a vector (the so-called embedding):

$f(y_{l,n}) = \left(\delta(NN_{\hat{Y}}(y_{l,n}) = \hat{y}_1),\ldots,\delta(NN_{\hat{Y}}(y_{l,n}) = \hat{y}_M)\right)$

where $NN_{\hat{Y}}(y_{l,n})$ denotes the nearest neighbor of $y_{l,n}$ (the $l$-th extracted descriptor in image $n$) in $\hat{Y}$. Therefore, $f_m(y_{l,n}) = 1$ if and only if $NN_{\hat{Y}}(y_{l,n}) = \hat{y}_m$. These embeddings are then aggregated in a single vector of word counts:

$F(Y_n) = \sum_{l = 1}^L f(y_{l,n})$

In practice, however, the so-called term-frequency inverse-document-frequency weighting is applied:

$F_m(Y_n) = \frac{\sum_{l = 1}^L f_m(y_{l,n})}{\sum_{m' = 1}^M \sum_{l = 1}^L f_{m'}(y_{l,n})} log\left(\frac{N}{\sum_{n = 1}^N \sum_{l = 1}^L f_m(y_{l,n})}\right)$

where $f_m(y_{l,n})$ and $F_m(Y_n)$ denote component $m$ of the corresponding vectors. The first term is the fraction of local descriptors assigned to visual word $\hat{y}_m$ and, thus, determines the importance of $\hat{y}_m$. In contrast, the second term down-weights the influence of local descriptors assigned to word $\hat{y}_m$ if it occurs frequently in the whole database.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: