Adaptive Grids for Clustering Massive Data Sets

Abstract

Clustering is a key data mining problem. Density and grid based technique is a popular way to mine clusters in a large multi-dimensional space wherein clusters are regarded as dense regions than their surroundings. The attribute values and ranges of these attributes characterize the clusters. Fine grid sizes lead to a huge amount of computation while coarse grid sizes result in loss in quality of clusters found. Also, varied grid sizes result in discovering clusters with different cluster descriptions. The technique of Adaptive grids enables to use grids based on the data distribution and does not require the user to specify any parameters like the grid size or the density thresholds. Further, clusters could be embedded in a subspace of a high dimensional space. We propose a modified bottom-up subspace clustering algorithm to discover clusters in all possible subspaces. Our method scales linearly with the data dimensionality and the size of the data set. Experimental results on a wide variety of synthetic and real data sets demonstrate the effectiveness of Adaptive grids and the effect of the modified subspace clustering algorithm. Our algorithm explores at-least an order of magnitude more number of subspaces than the original algorithm and the use of adaptive grids yields on an average of two orders of magnitude speedup as compared to the method with user specified grid size and threshold.

References

Page 1

	Year	Citations

Page 1