27^{th}NOVEMBER2015

J. Sivic, A. Zisserman. *Video google: A text retrieval approach to object matching in videos*. In Computer Vision, International Conference on, pages 1470–1477, Nice, France, October 2003.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below or get in touch with me:

Sivic and Zisserman, motivated by early text-retrieval systems, present the so-called bag of visual words model for image retrieval. First, they cluster all local descriptors, denoted by $Y = \bigcup_{n = 1}^N Y_n$ where $Y_n$ are the local descriptors extracted from image $n$ (out of $N$ images), using $k$-means clustering to define a vocabulary of visual words. Subsequently, descriptors are assigned to the nearest visual word and the global image representation is a sparse vector of word counts. Let $\hat{Y} = \{\hat{y}_1,\ldots,\hat{y}_M\}$ be the extracted visual words, then each extracted descriptor $y_{l,n} \in Y_n$ is represented by a vector (the so-called embedding):

$f(y_{l,n}) = \left(\delta(NN_{\hat{Y}}(y_{l,n}) = \hat{y}_1),\ldots,\delta(NN_{\hat{Y}}(y_{l,n}) = \hat{y}_M)\right)$

where $NN_{\hat{Y}}(y_{l,n})$ denotes the nearest neighbor of $y_{l,n}$ (the $l$-th extracted descriptor in image $n$) in $\hat{Y}$. Therefore, $f_m(y_{l,n}) = 1$ if and only if $NN_{\hat{Y}}(y_{l,n}) = \hat{y}_m$. These embeddings are then aggregated in a single vector of word counts:

$F(Y_n) = \sum_{l = 1}^L f(y_{l,n})$

In practice, however, the so-called term-frequency inverse-document-frequency weighting is applied:

$F_m(Y_n) = \frac{\sum_{l = 1}^L f_m(y_{l,n})}{\sum_{m' = 1}^M \sum_{l = 1}^L f_{m'}(y_{l,n})} log\left(\frac{N}{\sum_{n = 1}^N \sum_{l = 1}^L f_m(y_{l,n})}\right)$

where $f_m(y_{l,n})$ and $F_m(Y_n)$ denote component $m$ of the corresponding vectors. The first term is the fraction of local descriptors assigned to visual word $\hat{y}_m$ and, thus, determines the importance of $\hat{y}_m$. In contrast, the second term down-weights the influence of local descriptors assigned to word $\hat{y}_m$ if it occurs frequently in the whole database.