J. Philbin, M. Isard, J. Sivic, and A. Zisserman. Descriptor learning for efficient retrieval. In Computer Vision, European Conference on, volume 6313 of Lecture Notes in Computer Science, pages 677–691, Heraklion, Greece, September 2010. Springer.

Philbin et al. try to avoid quantization errors usually incurred by image retrieval systems based on the Visual Bag of Words model [1] (where so-called visual words are computed by quantizing a set of local descriptors extracted from a set of images using $k$-means) by learning a linear transformation $P\in\mathbb{R}^{c' \times c}$ such that matching local descriptors lie more closely together while non-matching local descriptors lie farther apart. Here, $c$ is the dimension of the extracted local descriptors and $c'$ is the desired dimension of the transformed local descriptors. Given a dataset of images annotated as matching and non-matching image pairs, a dataset of local descriptor pairs is created as follows:

  • Randomly select a pair of images, detect interest point and extract local descriptors;
  • then, apply Lowe's second-nearest neighbor ratio test (local descriptors where the fraction of distances to the first and the second nearest-neighbor is above $0.8$ are discarded where the second-nearest neighbor is constrained to come from an unrelated image);
  • and estimate an affine transformation using RANSAC.

The dataset consists of the inliers found by RANSAC as positives $Y_{\text{pos}} \subseteq \mathbb{R}^c \times \mathbb{R}^c$, the corresponding outliers as negatives $Y_{\text{out}}$ and random matches as "easy" negatives $Y_{\text{rnd}}$. Then, the transformation $P$ is learned by minimizing the following objective using stochastic gradient descent:

$E(P) = \sum_{(y_{l,n},y_{l',n}) \in Y_{\text{pos}}} \mathcal{L}(b_1 - d(Py_{l,n}, Py_{l',n})) + \sum_{(y_{l,n}, y_{l',n}) \in Y_{\text{out}}} \mathcal{L}(d(Py_{l,n}, Py_{l',n})-b_1)$
$+ \sum_{(y_{l,n}, y_{l',n}) \in Y_{\text{rnd}}} \mathcal{L}(d(Py_{l,n}, Py_{l',n}) - b_2) + \frac{\lambda}{2}\|P\|^2$(1)

where $d$ is a chosen distance function and $\mathcal{L}$ is the logistic loss:

$\mathcal{L}(z) = \log(1 + \exp(-z))$.

and $b_1$, $b_2$ are the parameters of the maximum margin approach described by Equation (1).

  • [1] J. Sivic, A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Computer Vision, International Conference on, pages 1470–1477, Nice, France, October 2003.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.