David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, November 2004.

In 2004, Lowe introduces to most popular interest point descriptor to date -- Scale Invariant Feature Transform (SIFT). While there are several implementations available (OpenCV's implementation, VLFeat's implementation, Vedaldi's implementation ...), Lowe holds a US patent for SIFT meaning that SIFT is often not applicable in practice. The following is an informal description of SIFT.

First, given a specific scale, the orientation of the interest point is determined by computing a gradient orientation histogram under a Gaussian window and using the maximum bin as orientation. In practice, $36$ bins are used to determine the orientation and up to three orientations corresponding to bins above 80% of the maximum bin are retained as additional orientations. Then, SIFT divides a rectangular region, rotated according to the orientation determined previously, around the interest point into a $4 \times 4$ grid. For each grid element, a 8-bin gradient orientation histogram is computed using trilinear interpolation. Pixels are weighted according to gradient magnitude and a Gaussian window. This results in a $c = 4\cdot 4\cdot 8$-dimensional descriptors which is $L_2$-normalized.

In image retrieval, for example, SIFT is the most commonly used local descriptor and several aggregation techniques are tailored specificly to SIFT descriptors (for example [1]). Nevertheless, several extensions have been proposed, especially concerning normalization. For example, Arandjelovic and Zisserman [2] ´ propose to use RootSIFT, that is the original SIFT descriptor is L1-normalized and the individual elements are square-rooted. As result, the Euclidean distance on RootSIFT resembles the Bhattacharyya coefficient of the original SIFT descriptors.

  • [1] H. Jégou, M. Douze, C. Schmid, P. Pérez. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition, Conference on, pages 3304–3311, San Fransisco, California, June 2010.
  • [2] R. Arandjelovic, A. Zisserman. Three things everyone should know to improve object retrieval. In Computer Vision and Pattern Recognition, Conference on, pages 2911–2918, Providence, Rhode Island, June 2012.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.