Document clustering using character N-grams

Abstract

We propose a novel method for document clustering using character N-grams. In the traditional vector-space model, the documents are represented as vectors, in which each dimension corresponds to a word. We propose a document representation based on the most frequent character N-grams, with window size of up to 10 characters. We derive a new distance measure, which produces uniformly better results when compared to the word-based and term-based methods. The result becomes more significant in the light of the robustness of the N-gram method with no language-dependent preprocessing. Experiments on the performance of a clustering algorithm on a variety of test document corpora demonstrate that the N-gram representation with n=3 outperforms both word and term representations. The comparison between word and term representations depends on the data set and the selected dimensionality.

References

Page 1

	Year	Citations

Page 1