Publication | Closed Access
Improving Arabic Text Categorization Using Transformer Training Diversification
27
Citations
27
References
2020
Year
Unknown Venue
EngineeringSocial Media PostsLanguage ProcessingJournalismText MiningNatural Language ProcessingClassification MethodArabicComputational LinguisticsDocument ClassificationNews HeadlinesNews RecommendationLanguage StudiesAutomatic CategorizationNews SemanticsContent AnalysisSocial Medium MiningMachine TranslationAutomatic ClassificationKnowledge DiscoverySocial Medium DataLinguistics
Automatic categorization of short texts, such as news headlines and social media posts, has many applications ranging from content analysis to recommendation systems. In this paper, we use such text categorization i.e., labeling the social media posts to categories like ‘sports’, ‘politics’, ‘human-rights’ among others, to showcase the efficacy of models across different sources and varieties of Arabic. In doing so, we show that diversifying the training data, whether by using diverse training data for the specific task (an increase of 21% macro F1) or using diverse data to pre-train a BERT model (26% macro F1), leads to overall improvements in classification effectiveness. In our work, we also introduce two new Arabic text categorization datasets, where the first is composed of social media posts from a popular Arabic news channel that cover Twitter, Facebook, and YouTube, and the second is composed of tweets from popular Arabic accounts. The posts in the former are nearly exclusively authored in modern standard Arabic (MSA), while the tweets in the latter contain both MSA and dialectal Arabic.
| Year | Citations | |
|---|---|---|
Page 1
Page 1