JANUARY2016
J. Philbin, M. Isard, J. Sivic, and A. Zisserman. Descriptor learning for efficient retrieval. In Computer Vision, European Conference on, volume 6313 of Lecture Notes in Computer Science, pages 677–691, Heraklion, Greece, September 2010. Springer.
Philbin et al. try to avoid quantization errors usually incurred by image retrieval systems based on the Visual Bag of Words model [1] (where so-called visual words are computed by quantizing a set of local descriptors extracted from a set of images using $k$-means) by learning a linear transformation $P\in\mathbb{R}^{c' \times c}$ such that matching local descriptors lie more closely together while non-matching local descriptors lie farther apart. Here, $c$ is the dimension of the extracted local descriptors and $c'$ is the desired dimension of the transformed local descriptors. Given a dataset of images annotated as matching and non-matching image pairs, a dataset of local descriptor pairs is created as follows:
The dataset consists of the inliers found by RANSAC as positives $Y_{\text{pos}} \subseteq \mathbb{R}^c \times \mathbb{R}^c$, the corresponding outliers as negatives $Y_{\text{out}}$ and random matches as "easy" negatives $Y_{\text{rnd}}$. Then, the transformation $P$ is learned by minimizing the following objective using stochastic gradient descent:
$E(P) = \sum_{(y_{l,n},y_{l',n}) \in Y_{\text{pos}}} \mathcal{L}(b_1 - d(Py_{l,n}, Py_{l',n})) + \sum_{(y_{l,n}, y_{l',n}) \in Y_{\text{out}}} \mathcal{L}(d(Py_{l,n}, Py_{l',n})-b_1)$
$+ \sum_{(y_{l,n}, y_{l',n}) \in Y_{\text{rnd}}} \mathcal{L}(d(Py_{l,n}, Py_{l',n}) - b_2) + \frac{\lambda}{2}\|P\|^2$(1)
where $d$ is a chosen distance function and $\mathcal{L}$ is the logistic loss:
$\mathcal{L}(z) = \log(1 + \exp(-z))$.
and $b_1$, $b_2$ are the parameters of the maximum margin approach described by Equation (1).