Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data

Abstract

Finding clusters in data, especially high dimensional data, is challenging when the clusters are of widely differing shapes, sizes, and densities, and when the data contains noise and outliers. We present a novel clustering technique that addresses these issues. Our algorithm first finds the nearest neighbors of each data point and then redefines the similarity between pairs of points in terms of how many nearest neighbors the two points share. Using this definition of similarity, our algorithm identifies core points and then builds clusters around the core points. The use of a shared nearest neighbor definition of similarity alleviates problems with varying densities and high dimensionality, while the use of core points handles problems with shape and size. While our algorithm can find the “dense” clusters that other clustering algorithms find, it also finds clusters that these approaches overlook, i.e., clusters of low or medium density which represent relatively uniform regions “surrounded” by non-uniform or higher density areas. We experimentally show that our algorithm performs better than traditional methods (e.g., K-means, DBSCAN, CURE) on a variety of data sets: KDD Cup ‘99 network intrusion data, NASA Earth science time series data, and two-dimensional point sets. The run-time complexity of our technique is O(n2) if the similarity matrix has to be constructed. However, we discuss a number of optimizations that allow the algorithm to handle large data sets efficiently.

References

Page 1

	Year	Citations

Page 1