Concepedia

Publication | Closed Access

Improving Topic Classification for Highly Inflective Languages

12

Citations

19

References

2012

Year

Abstract

Despite the existence of many effective methods to solve topic classification tasks for such widely used languages as English, there is no clear answer whether these methods are suitable for languages that are substantially different. We attempt to solve a topic classification task for Lithuanian, a relatively resource-scarce language that is highly inflective, has a rich vocabulary, and a complex word derivation system. We show that classification performance is significantly higher when the inflective character of the language is taken into account by using character n-grams as opposed to the more common bag-of-words approach. These results are not only promising for Lithuanian, but also for other languages with similar properties. We show that the performance of classifiers based on character n-grams even surpasses that of classifiers built on stemmed or lemmatized text. This indicates that topic classification is possible even for languages for which automatic grammatical tools are not available. TITLE AND ABSTRACT IN LITHUANIAN Klasifikavimo į temas gerinimas stipriai kaitomoms kalboms Nepaisant to, jog tokioms plačiai naudojamoms kalboms kaip anglų yra sukurta daug efektyvių

References

YearCitations

Page 1