N-grams based features for Indonesian tweets classification problems

Abstract

Twitter is one of popular microblogging services that allows users to write short messages up to 140 characters. Twitter active users in Indonesia have reached 29.4 million in 2017 and they have created an enormous number of tweets, a potential data source for supervised learning. In this work, six different set of n-grams words dictionaries were generated and they were used as references for creating numerical features of the tweets. We classified the tweets using k-Nearest Neighbors (k-NN) and Naive Bayes Classifier and compared the accuracy using F-measure. We also observed the classification times of each algorithm. The results show that k-NN algorithm performed better than Naive Bayes Classifier, i.e. 81.2% for F-measure using k=7. However, in terms of classification time, Naive Bayes Classifier is faster than k-NN for all k parameters.

References

Page 1

	Year	Citations

Page 1