H. Jégou, A. Zisserman. Triangulation embedding and democratic aggregation for image search. In Computer Vision and Pattern Recognition, Conference on, pages 3310–3317, Columbus, June 2014.

Jégou and Zisserman consider an image retrieval system similar to the Bag of Visual Words model [1] where local descriptors $y_{l,n}$ computed on images $x_n$, $n = 1,\ldots,N$, are quantized into visual words $\hat{Y} = \{\hat{y}_1,\ldots,\hat{y}_M\}$ (e.g. uisng $k$-means) and each image is described by

$F(Y_n) = \sum_{l = 1}^L f(y_{l,n})$ with $f(y_{l,n}) \in \mathbb{R}^M$, $f_m(y_{l,n}) = 1 \Leftrightarrow \hat{y}_m = NN_{\hat{Y}}(y_{l,n})$.

Here, $Y_n$ is the set of local descriptors for image $x_n$ and $NN_{\hat{Y}}(y_{l,n})$ represents the nearest neighbor of local descriptor $y_{l,n}$ in $\hat{Y}$.

Jégou and Zisserman note that the above aggregation function $F(Y_n)$ gives unequal weights to the different local descriptors $y_{l,n} \in Y_n$. Therefore, they consider the contribution of a single descriptor $y_{l,n}$ to the set similarity

$F(Y_n)^T F(Y_n) = \sum_{l = 1}^L \sum_{l' = 1}^L f(y_{l,n})^T f(y_{l',n})$(1)

which can be written as

$f(y_{l,n})^T \sum_{l' = 1}^L f(y_{l',n}) = \|f(y_{l,n})\|_2^2 + f(y_{l,n})^T \sum_{l \neq l'} f(y_{l',n})$.

To let each local descriptor $y_{l,n}$ contribute equally to Equation (1), Jégou and Zisserman first compute the residuals

$r_{l,m} = \frac{y_{l,n} - \hat{y}_m}{\|y_{l,n} - \hat{y}_m\|_2}$

for each local descriptor $y_{l,n}$. These residuals only contain directional information (e.g. in contrast to VLAD [2] where unnormalized residuals are used). With $r_l = (r_{l,1}^T,\ldots,r_{l,M}^T)$, the local descriptor $y_{l,n}$ is then embedded using

$f(y_{l,n}) = \Sigma(r_l)^\frac{1}{2}(r_k - \mu(r_l))$.

$\Sigma(r_l)$ and $\mu(r_l)$ are covariance and mean estimated on all $r_l$ (i.e. across all local descriptors $y_{l,n}$ for all images $n = 1,\ldots,N$). Furthermore, Jégou and Zisserman propose the concept of a democratic kernel: a kernel $k(y_{l,n}, y_{l',n})$ is called democratic if there exists a constant $\kappa$ such that

$\sum_{l' = 1}^L k(y_{l,n}, y_{l',n}) = \kappa$ $\forall 1 \leq l \leq L$.

Such a kernel can efficiently be learned using an adapted Sinkhorn algorithm [3] when considering a weighted kernel

$k'(y_{l,n}, y_{l',n}) = \lambda_l \lambda_{l'} k(y_{l,n}, y_{l',n}) = \lambda_l \lambda_{l'} y_{l,n}^Ty_{l',n}$

and solving the system of equations given by

$\lambda_l\left(\sum_{l' = 1}^L \lambda_{l'} k(y_{l,n}, y_{l',n})\right) = \kappa$ with $\lambda_{l} > 0$, $\forall 1 \leq l \leq L$.

The embeddings $f(y_{l,n})$ are then aggregated using the weighted sum

$F(Y_n) = \sum_{l = 1}^L \lambda_l f(y_{l,n})$.

  • [1] J. Sivic, A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Computer Vision, International Conference on, pages 1470–1477, Nice, France, October 2003.
  • [2] H. Jégou, M. Douze, C. C. Schmid, P. Pérez. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition, Conference on, pages 3304–3311, San Fransisco, California, June 2010.
  • [3] R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. Annals of Mathematical Statistics, 35(2):876-879, June 1964.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.