C. Elkan. Using the Triangle Inequality to Accelerate k-Means. International Conference on Machine Learning, 2003.

There are several drawbacks in using $k$-means clustering, among others the sensitivity to outliers, the problem of how to choose $k$ or how to initialize the $k$ centers [1]. Although the problem of initializing the centers is a well-studied problem (e.g. using $k$-means++ [2]), in practice, $k$-means is often run multiple times with different initial centers to get good results. For such approaches, the runtime of $k$-means - especially in high dimensions with a large set of samples - is crucial. Elkan proposes to use the triangle inequality in order to reduce the number of distance calculations and, thus, reduce runtime. His algorithm is based on the following lemmata:

Lemma 1: Let $d$ be a distance function. Further, let $x$ be a data point and $c, c'$ two centers. Then it holds:

if $d(c, c') \geq 2 d(x, c')$, then $d(x,c) \geq d(x, c')$.(1)

Lemma 2: Let $d$ be a distance function, $x$ be a data point and $c, c'$ centers. Then:

$d(x,c) \geq \max\{0, d(x, c') - c(c, c')\}$.

In practice, Lemma 1 is used to reduce the number of distance calculations by computing the distances $d(c, c')$ and applying equation (1). Lemma 2 is used to maintain a lower bound on the distances of each point $x$ to each center $c$, see the paper for details.

  • [1] J. Han, M. Kamber, J.Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc. San Francisco, CA, 2005.
  • [2] D. Arthur, S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2007.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.