Concepedia

TLDR

Finnish’s highly inflectional and agglutinative morphology suggests that lemmatization, which splits compound words, may be a more suitable normalization than simple stemming. The study compared stemming and lemmatization in clustering Finnish text documents. Document relevance was assessed using a four‑point scale collapsed to binary, and hierarchical clustering was applied. Experiments with four hierarchical clustering methods demonstrated that lemmatization outperformed stemming, particularly for single and complete linkage on highly relevant documents and achieving higher precision with average linkage and Ward’s methods.

Abstract

Stemming and lemmatization were compared in the clustering of Finnish text documents. Since Finnish is a highly inflectional and agglutinative language, we hypothesized that lemmatization, involving splitting of the compound words, would be more appropriate normalization approach than the straightforward stemming. The relevance of the documents were evaluated with a four-point relevance assessment scale, which was collapsed into binary one by considering all the relevant and only the highly relevant documents relevant, respectively. Experiments with four hierarchical clustering methods supported the hypothesis. The stringent relevance scale showed that lemmatization allowed the single and complete linkage methods to recover especially the highly relevant documents better than stemming. In comparison with stemming, lemmatization together with the average linkage and Ward's methods produced higher precision. We conclude that lemmatization is a better word normalization method than stemming, when Finnish text documents are clustered for information retrieval.

References

YearCitations

Page 1