In 2004, Lowe introduces to most popular interest point descriptor to date -- Scale Invariant Feature Transform (SIFT). While there are several implementations available (OpenCV's implementation, VLFeat's implementation, Vedaldi's implementation ...), Lowe holds a US patent for SIFT meaning that SIFT is often not applicable in practice. The following is an informal description of SIFT.
First, given a specific scale, the orientation of the interest point is determined by computing a gradient orientation histogram under a Gaussian window and using the maximum bin as orientation. In practice, $36$ bins are used to determine the orientation and up to three orientations corresponding to bins above 80% of the maximum bin are retained as additional orientations. Then, SIFT divides a rectangular region, rotated according to the orientation determined previously, around the interest point into a $4 \times 4$ grid. For each grid element, a 8-bin gradient orientation histogram is computed using trilinear interpolation. Pixels are weighted according to gradient magnitude and a Gaussian window. This results in a $c = 4\cdot 4\cdot 8$-dimensional descriptors which is $L_2$-normalized.
In image retrieval, for example, SIFT is the most commonly used local descriptor and several aggregation techniques are tailored specificly to SIFT descriptors (for example ). Nevertheless, several extensions have been proposed, especially concerning normalization. For example, Arandjelovic and Zisserman  ´ propose to use RootSIFT, that is the original SIFT descriptor is L1-normalized and the individual elements are square-rooted. As result, the Euclidean distance on RootSIFT resembles the Bhattacharyya coefficient of the original SIFT descriptors.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or using the following platforms: