Publication | Closed Access
Hierarchical Document Clustering Using Frequent Itemsets
466
Citations
45
References
2003
Year
Unknown Venue
Natural Language ProcessingCluster ComputingFrequent ItemsDocument ClusteringEngineeringFrequent Pattern MiningInformation RetrievalData MiningData ScienceAssociation RuleTopic ModelKnowledge DiscoveryPattern MiningStructure MiningFrequent ItemsetsStatisticsCorpus LinguisticsText Mining
Document clustering faces extreme high dimensionality, with vocabularies of thousands of words yet each document containing only a small fraction, necessitating special handling and hierarchical clustering to browse topics by increasing specificity. This paper proposes using frequent itemsets from association rule mining to cluster documents. Clusters are defined by common frequent itemsets, which also generate a hierarchical topic tree. By focusing on frequent items, dimensionality is drastically reduced and the method outperforms existing approaches in clustering accuracy and scalability.
A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.
| Year | Citations | |
|---|---|---|
Page 1
Page 1