Concepedia

Abstract

Document clustering is an important tool for text analysis and is used in many applications. This work develops a novel hierarchal algorithm for document clustering. We are particularly interested in studying and making use of cluster overlapping phenomenon to design cluster merging criteria. In our previous papers, the theoretical results on the overlap rate between clusters based on the Gaussian mixture model were reported. In this paper, we propose a new way to compute the overlap rate in order to improve time efficiency and "the veracity". The way is that we use a line passed through the two cluster's center instead of the ridge curve. Based on the hierarchical clustering method, we use the expectation-maximization (EM) algorithm in the Gaussian mixture model to count the parameters and make the two sub-clusters combined when their overlap is the largest. Experiments in both public data and document clustering data show that this approach can improve the efficiency of clustering and save computing time.

References

YearCitations

Page 1