IAM

OPENSOURCEFAN STUDYING
STUDYINGCOMPUTERSCIENCEANDMATH COMPUTERSCIENCE

DAVIDSTUTZ

Check out the latest superpixel benchmark — Superpixel Benchmark (2016) — and let me know your opinion! @david_stutz
25thOCTOBER2015

READING

J. A. Lee, M. Verleysen. Nonlinear dimensionality reduction. Springer , New York; London, 2007.

Another part of the "Data Mining Algorithms 2" lecture at RWTH Aachen University was high-dimensional clustering. Before introducing nonlinear dimensionality reduction techniques, Lee and Verleysen introduce several problems arising in high dimensionality:

  • Empty space phenomenon. Considering a Cartesian grid on the unit cube with fixed spacing, the number of grid points grows exponentially with the dimension $d$. Thus, for a fixed number of data points and high dimensionality, most grid cells will be empty.
  • Hypervolume of cubes and spheres. The hypervolume of a cube and a sphere of unit length (where the cube contains the sphere) in $d$ dimensions is given as:

    $V_{sphere} = \frac{(\sqrt{\pi} r)^d}{\Gamma(1 + \frac{d}{2})}$,

    $V_{cube} = (2r)^d$.

    Both, the ratio $\frac{V_{sphere}}{V_{cube}}$ and the hypervolume $V_{sphere}$ of the sphere tend towards zero with increasing dimensionality:

    $\lim_{d \rightarrow \infty} V_{sphere} = 0$ and $\lim_{d \rightarrow \infty} \frac{V_{sphere}}{V_{cube}} = 0$

  • Hypervolume of a thin spherical shell. The volume of a spherical shell tends towards 1, i.e. in high dimensions most of the volume of a sphere is contained in its shell.
  • Concentration of norms. The norm of random vectors grows proportionally to $\sqrt{D}$ while the variance stays constant. In particular, if $y$ is a $d$-dimensional vector where all components are independent and identically distributed (with a finite eigth order moment), the mean $\mu_{\|y\|}$ and the variance $\sigma^2_{\|y\|}$ of the Euclidean norm are

    $\mu_{\|y\|} = \sqrt{ad - b} + \mathcal{O}(\frac{1}{d})$,

    $\sigma^2_{\|y\|} = b + \mathcal{O}(\frac{1}{\sqrt{d}})$.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or using the following platforms: