IAM

OCTOBER2015

READING

J. A. Lee, M. Verleysen. Nonlinear dimensionality reduction. Springer , New York; London, 2007.

Another part of the "Data Mining Algorithms 2" lecture at RWTH Aachen University was high-dimensional clustering. Before introducing nonlinear dimensionality reduction techniques, Lee and Verleysen introduce several problems arising in high dimensionality:

  • Empty space phenomenon. Considering a Cartesian grid on the unit cube with fixed spacing, the number of grid points grows exponentially with the dimension $d$. Thus, for a fixed number of data points and high dimensionality, most grid cells will be empty.
  • Hypervolume of cubes and spheres. The hypervolume of a cube and a sphere of unit length (where the cube contains the sphere) in $d$ dimensions is given as:

    $V_{sphere} = \frac{(\sqrt{\pi} r)^d}{\Gamma(1 + \frac{d}{2})}$,

    $V_{cube} = (2r)^d$.

    Both, the ratio $\frac{V_{sphere}}{V_{cube}}$ and the hypervolume $V_{sphere}$ of the sphere tend towards zero with increasing dimensionality:

    $\lim_{d \rightarrow \infty} V_{sphere} = 0$ and $\lim_{d \rightarrow \infty} \frac{V_{sphere}}{V_{cube}} = 0$

  • Hypervolume of a thin spherical shell. The volume of a spherical shell tends towards 1, i.e. in high dimensions most of the volume of a sphere is contained in its shell.
  • Concentration of norms. The norm of random vectors grows proportionally to $\sqrt{D}$ while the variance stays constant. In particular, if $y$ is a $d$-dimensional vector where all components are independent and identically distributed (with a finite eigth order moment), the mean $\mu_{\|y\|}$ and the variance $\sigma^2_{\|y\|}$ of the Euclidean norm are

    $\mu_{\|y\|} = \sqrt{ad - b} + \mathcal{O}(\frac{1}{d})$,

    $\sigma^2_{\|y\|} = b + \mathcal{O}(\frac{1}{\sqrt{d}})$.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.