C. Elkan. Using the Triangle Inequality to Accelerate k-Means. International Conference on Machine Learning, 2003.

There are several drawbacks in using $k$-means clustering, among others the sensitivity to outliers, the problem of how to choose $k$ or how to initialize the $k$ centers [1]. Although the problem of initializing the centers is a well-studied problem (e.g. using $k$-means++ [2]), in practice, $k$-means is often run multiple times with different initial centers to get good results. For such approaches, the runtime of $k$-means - especially in high dimensions with a large set of samples - is crucial. Elkan proposes to use the triangle inequality in order to reduce the number of distance calculations and, thus, reduce runtime. His algorithm is based on the following lemmata:

Lemma 1: Let $d$ be a distance function. Further, let $x$ be a data point and $c, c'$ two centers. Then it holds:

if $d(c, c') \geq 2 d(x, c')$, then $d(x,c) \geq d(x, c')$.(1)

Lemma 2: Let $d$ be a distance function, $x$ be a data point and $c, c'$ centers. Then:

$d(x,c) \geq \max\{0, d(x, c') - c(c, c')\}$.

In practice, Lemma 1 is used to reduce the number of distance calculations by computing the distances $d(c, c')$ and applying equation (1). Lemma 2 is used to maintain a lower bound on the distances of each point $x$ to each center $c$, see the paper for details.

  • [1] J. Han, M. Kamber, J.Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc. San Francisco, CA, 2005.
  • [2] D. Arthur, S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2007.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: