Concepedia

Publication | Closed Access

Integrating Collocation as TF-IDF Enhancement to Improve Classification Accuracy

30

Citations

17

References

2019

Year

Abstract

The motivation of the study is to address the weakness of Term Frequency - Inverse Document Frequency (TF-IDF) in dealing with single terms because single terms can sometimes be vague. That is, a single term when used for indexing, could convey several interpretations. A single term can also be too general, in which, it doesn't have a discriminating power to differentiate terms such as from two individual terms such as "junior" and "college." It is not enough to distinguish "junior college" from "college junior". Thus, this study aims to enhance TF-IDF by integrating collocation as a term feature. The collocated terms are extracted based on the determination of part-of-speech (POS) that forms specific patterns such as adjective + noun, noun + noun, noun + verb, etc. There are three (3) document classifiers which had been considered in this study. These classifiers will be subjected to traditional and modified TF-IDF are RandomForest, MultinomialNB (MultiNB), and SVM. The result of this experiment shows that integrating collocation as part of the enhancement of the TF-IDF process outperforms the traditional TF-IDF by an increase of up to 10 percent.

References

YearCitations

Page 1