Concepedia

TLDR

k‑means is a long‑standing, widely used clustering algorithm whose performance depends heavily on initialization; although k‑means++ improves initialization, its sequential nature limits scalability, and prior parallel efforts have focused mainly on post‑initialization phases. The study aims to reduce the number of passes required for parallel initialization of k‑means. We propose k‑means||, a parallel algorithm that constructs an initial set of centers in a logarithmic number of passes by sampling points in multiple rounds. We prove that k‑means|| achieves near‑optimal initialization after logarithmic passes and demonstrate experimentally that it outperforms k‑means++ in both sequential and parallel settings on large‑scale data.

Abstract

Over half a century old and showing no signs of aging, k -means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k -means is crucial for obtaining a good final solution. The recently proposed k -means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k -means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k -means that have mostly focused on the post-initialization phases of k -means. We prove that our proposed initialization algorithm k -means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k -means|| outperforms k -means++ in both sequential and parallel settings.

References

YearCitations

Page 1