Concepedia

Publication | Closed Access

An analysis of document clustering algorithms

11

Citations

6

References

2010

Year

Abstract

Document clustering organizes documents into groups such that each group contains documents with similar content. This paper presents the results of an experimental study of some common document clustering techniques. In particular, comparison of Euclidean K-means (K-Means), Spherical K-means(SK-Means) and unsupervised Principal Direction Divisive Partitioning (PDDP) algorithms is done. A comparative analysis of the algorithms is performed using the evaluation measures, Entropy and F-measure. The experiments were conducted on the standard dataset. Clustering algorithms such as K-means and SK-means are easy to implement but their answers strongly depend on their initialization. PDDP is comparatively difficult to implement since it is a hierarchical algorithm. On the other hand its performance does not depend on initial clusters. The results indicate that for certain initial clusters, the K-means and SK-means performed well than PDDP. When there are equal numbers of documents in all the classes, the clusters produced by the algorithms were very effective to that of when different classes had different number of documents. Also with no stop word removal the quality of PDDP degraded compared to K-Means.

References

YearCitations

Page 1