Learning the k in k-means

TLDR

Choosing the number of clusters k in clustering is often unclear, making automatic selection a difficult algorithmic problem. This paper proposes an improved algorithm for automatically determining k while clustering. The G‑means algorithm iteratively runs k‑means with increasing k, applying a Gaussianity test at each step and stopping when all clusters satisfy the test, requiring only a significance level α and no full covariance matrix. Experiments show G‑means performs well and outperforms a recent BIC‑penalty method, indicating that BIC does not adequately penalize model complexity.

Abstract

When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model's complexity.

References

Page 1

	Year	Citations

Page 1