Publication | Closed Access
Document clustering using character N-grams
38
Citations
23
References
2005
Year
Unknown Venue
EngineeringCorpus LinguisticsText MiningWord EmbeddingsNatural Language ProcessingInformation RetrievalData ScienceText SegmentationComputational LinguisticsN-gram RepresentationWindow SizeDocument ClassificationLanguage StudiesDocument ClusteringKnowledge DiscoveryCharacter N-gramsVector Space ModelKeyword ExtractionLinguisticsSemantic Similarity
We propose a novel method for document clustering using character N-grams. In the traditional vector-space model, the documents are represented as vectors, in which each dimension corresponds to a word. We propose a document representation based on the most frequent character N-grams, with window size of up to 10 characters. We derive a new distance measure, which produces uniformly better results when compared to the word-based and term-based methods. The result becomes more significant in the light of the robustness of the N-gram method with no language-dependent preprocessing. Experiments on the performance of a clustering algorithm on a variety of test document corpora demonstrate that the N-gram representation with n=3 outperforms both word and term representations. The comparison between word and term representations depends on the data set and the selected dimensionality.
| Year | Citations | |
|---|---|---|
Page 1
Page 1