Web-scale k-means clustering

TLDR

The paper proposes two k‑means modifications to meet latency, scalability, and sparsity demands of web applications. The authors use mini‑batch optimization and projected gradient descent with a fast ε‑accurate L1‑ball projection to achieve scalability and sparsity. Mini‑batch optimization cuts computation cost by orders of magnitude and outperforms online stochastic gradient descent. Source code is available at http://code.google.com/p/sofia-ml.

Abstract

We present two modifications to the popular k-means clustering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast ε-accurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml

References

Page 1

	Year	Citations

Page 1