Clustering for metric and nonmetric distance measures

Abstract

We study a generalization of the k -median problem with respect to an arbitrary dissimilarity measure D. Given a finite set P of size n , our goal is to find a set C of size k such that the sum of errors D( P,C ) = ∑ p ∈ P min c ∈ C {D( p,c )} is minimized. The main result in this article can be stated as follows: There exists a (1+ϵ)-approximation algorithm for the k -median problem with respect to D, if the 1-median problem can be approximated within a factor of (1+ϵ) by taking a random sample of constant size and solving the 1-median problem on the sample exactly. This algorithm requires time n 2 O ( mk log( mk /ϵ)), where m is a constant that depends only on ϵ and D. Using this characterization, we obtain the first linear time (1+ϵ)-approximation algorithms for the k -median problem in an arbitrary metric space with bounded doubling dimension, for the Kullback-Leibler divergence (relative entropy), for the Itakura-Saito divergence, for Mahalanobis distances, and for some special cases of Bregman divergences. Moreover, we obtain previously known results for the Euclidean k -median problem and the Euclidean k -means problem in a simplified manner. Our results are based on a new analysis of an algorithm of Kumar et al. [2004].

References

Page 1

	Year	Citations
On Information and Sufficiency S. Kullback, R. A. Leibler The Annals of Mathematical Statistics EngineeringInformation TheoryData ScienceInformation EconomicsComputational Complexity	1951	19.5K
Least squares quantization in PCM Sheelagh Lloyd IEEE Transactions on Information Theory EngineeringQuantum ComputingQuantization IntervalsOptimization CriterionComputer Engineering	1982	15.1K
k-means++: the advantages of careful seeding David Arthur, Sergei Vassilvitskii Symposium on Discrete Algorithms Cluster ComputingClustering TechniqueClustering (Nuclear Physics)EngineeringData Science	2007	6.3K
On the generalized distance in statistics P. C. Mahalanobis SHILAP Revista de lepidopterología EngineeringGeneralized DistanceStatistical InferenceProbability TheoryStatistical Science	1936	6K
The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming L.M. Bregman USSR Computational Mathematics and Mathematical Physics Mathematical ProgrammingCommon PointEngineeringConvex SetsConvex Optimization	1967	2.6K
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Bregman DivergencesDocument ClusteringDensity EstimationMixture DistributionEngineering	2004	1.5K
Distributional clustering of English words Fernando Pereira, Naftali Tishby, Lillian Lee EngineeringNeurolinguisticsSemanticsCorpus LinguisticsText Mining	1993	994
Parallel optimization: Theory, algorithms, and applications Computers & Mathematics with Applications EngineeringParallel Complexity TheoryParallel ProcessingParallel ProgrammingComputer Science	1998	826
Distributional clustering of words for text classification Lee D. Baker, Andrew Kachites McCallum EngineeringComputational AnalysisText MiningNatural Language ProcessingInformation Retrieval	1998	681
Speech coding based upon vector quantization A. Buzo, Alfred Gray, Robert M. Gray, IEEE Transactions on Acoustics Speech and Signal Processing Digital AudioVector QuantizationEngineeringSpeech CodingHealth Sciences	1980	559

Page 1