I am looking for full-time (applied) research opportunities in industry, involving (trustworthy and robust) machine learning or (3D) computer vision, starting early 2022. Check out my CV and get in touch on LinkedIn!


H. Jégou, M. Douze, C. C. Schmid, P. Pérez. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition, Conference on, pages 3304–3311, San Fransisco, California, June 2010.

Similar to the Bag of Visual Words model [1], Jégou et al. approach the problem of image retrieval by first computing a vocabulary $M$ of visual words learned using $k$-means clustering. Instead of counting word occurrences, they consider the corresponding residuals:

$f(y_{l,n}) = (\delta(NN_{\hat{Y}}(y_{l,n}) = \hat{y}_1)(y_{l,n} - \hat{y}_1), \ldots, \delta(NN_{\hat{Y}}(y_{l,n}) = \hat{y}_M)(y_{l,n} - \hat{y}_M))$

where $\hat{Y}= \{\hat{y}_1, \ldots, \hat{y}_M\}$ is the set of learned visual words and $y_{l,n}$ is the $l$-th feature extracted from image $n$. These so-called embeddings are then aggregated using

$F(Y_n) = \sum_{l = 1}^L f(y_{l,n})$.

The image representation $F(Y_n)$ is usually $L_2$-normalized.

  • [1] J. Sivic, A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Computer Vision, International Conference on, pages 1470–1477, Nice, France, October 2003.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: