K-Means Clustering Over a Large, Dynamic Network

Abstract

This paper presents an algorithm for K-means clustering of data distributed over a large, dynamic network. The network is not assumed to contain any special server nodes (a peer-to-peer network) and is not assumed to be stable either with respect to the topology or the data held by nodes. The algorithm requires only local communication and synchronization at each iteration: nodes communicate and synchronize only with their topologically neighboring nodes. Due to the growing prevalence of peer-to-peer and mobile/wireless sensor networks, data analysis in large, dynamic networks is likely to garner increasing importance in the near future. To our knowledge, our algorithm represents the first K-means algorithm (a common data analysis/mining technique) to be developed for a large dynamic network. We tested our algorithm in a simulated environment of up to 1000 nodes on synthetic data. We examine its behavior in a static environment (no data or network change) and a dynamic environment. Empirical results show the algorithm demonstrates good accuracy (in both the static and dynamic environment) in that the cluster labels produced are very similar to those produced by K-means run on centralized data.