<i>K</i>-means clustering via principal component analysis

TLDR

Principal component analysis is a widely used technique for unsupervised dimension reduction, while K‑means clustering is a common method for unsupervised data clustering. The study proves that principal components are the continuous solutions to the discrete cluster membership indicators for K‑means clustering. We derive new lower bounds for the K‑means objective function as the total variance minus the eigenvalues of the data covariance matrix and illustrate the results on DNA gene expression and Internet newsgroups data. The results show that unsupervised dimension reduction is closely related to unsupervised learning, provide new insights into PCA’s effectiveness beyond noise reduction, suggest effective K‑means clustering techniques, and demonstrate that the new bounds are within 0.5–1.5% of optimal values.

Abstract

Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. New lower bounds for K-means objective function are derived, which is the total variance minus the eigenvalues of the data covariance matrix. These results indicate that unsupervised dimension reduction is closely related to unsupervised learning. Several implications are discussed. On dimension reduction, the result provides new insights to the observed effectiveness of PCA-based data reductions, beyond the conventional noise-reduction explanation that PCA, via singular value decomposition, provides the best low-dimensional linear approximation of the data. On learning, the result suggests effective techniques for K-means data clustering. DNA gene expression and Internet newsgroups are analyzed to illustrate our results. Experiments indicate that the new bounds are within 0.5-1.5% of the optimal values.

References

Page 1

	Year	Citations

Page 1