Efficient Anytime Density-based Clustering

Abstract

Many clustering algorithms suffer from scalability problems on massive datasets and do not support any user interaction during runtime. To tackle these problems, anytime clustering algorithms are proposed. They produce a fast approximate result which is continuously refined during the further run. Also, they can be stopped or suspended anytime and provide an answer. In this paper, we propose a novel anytime clustering algorithm based on the density-based clustering paradigm. Our algorithm called A-DBSCAN is applicable to very high dimensional databases such as time series, trajectory, medical data, etc. The general idea of our algorithm is to use a sequence of lower-bounding functions (LBs) of the true similarity measure to produce multiple approximate results of the true density-based clusters. A-DBSCAN operates in multiple levels w.r.t. the LBs and is mainly based on two algorithmic schemes: (1) an efficient distance upgrade scheme which restricts distance calculations to core-objects at each level of the LBs; (2) a local re-clustering scheme which restricts update operations to the relevant objects only. Extensive experiments demonstrate that A-DBSCAN acquires very good clustering results at very early stages of execution thus saves a large amount of computational time. Even if it runs to the end, A-DBSCAN is still orders of magnitude faster than DBSCAN.