Publication | Closed Access
Affinity Clustering: Hierarchical Clustering at Scale
69
Citations
0
References
2017
Year
Cluster ComputingEngineeringMachine-learning PipelinesNetwork AnalysisMap-reduceUnsupervised Machine LearningGraph ProcessingData ScienceData MiningParallel ComputingDocument ClusteringKnowledge DiscoveryComputer ScienceAffinity ClusteringGraph ClusteringGraph TheoryBusinessParallel ProgrammingGraph AnalysisDistributed Hash TablesMassive Data ProcessingBig Data
Graph clustering is a fundamental task in many data-mining and machine-learning pipelines. In particular, identifying a good hierarchical structure is at the same time a fundamental and challenging problem for several applications. The amount of data to analyze is increasing at an astonishing rate each day. Hence there is a need for new solutions to efficiently compute effective hierarchical clusterings on such huge data. The main focus of this paper is on minimum spanning tree (MST) based clusterings. In particular, we propose affinity, a novel hierarchical clustering based on Boruvka's MST algorithm. We prove certain theoretical guarantees for affinity (as well as some other classic algorithms) and show that in practice it is superior to several other state-of-the-art clustering algorithms. Furthermore, we present two MapReduce implementations for affinity. The first one works for the case where the input graph is dense and takes constant rounds. It is based on a Massively Parallel MST algorithm for dense graphs that improves upon the state-of-the-art algorithm of Lattanzi et al. (SPAA 2011). Our second algorithm has no assumption on the density of the input graph and finds the affinity clustering in $O(\log n)$ rounds using Distributed Hash Tables (DHTs). We show experimentally that our algorithms are scalable for huge data sets, e.g., for graphs with trillions of edges.