Publication | Closed Access
Automatic identification of arabic dialects in social media
58
Citations
20
References
2014
Year
Unknown Venue
Arabic Dialect LinguisticsNaive Bayes ClassifiersEngineeringMedia ArabicCorpus LinguisticsText MiningNatural Language ProcessingSocial MediaInformation RetrievalArabicComputational LinguisticsDocument ClassificationLanguage StudiesContent AnalysisMachine TranslationSocial Medium MiningNaive Bayes ClassifierModern Standard ArabicArabic Dialect Morphological AnalysisLanguage RecognitionSocial Medium DataText ProcessingLinguistics
Modern Standard Arabic is the formal language used across Arabic countries, while Arabic dialects differ markedly, especially in social media where texts often mix MSA and dialect forms. The study proposes a framework to classify Arabic dialects in social media by bridging MSA and AD using probabilistic models. The authors conduct experiments with character n‑gram Markov language models and Naive Bayes classifiers to evaluate model performance under varying social media conditions. The Naive Bayes classifier using character bi‑grams identifies 18 Arabic dialects with 98% accuracy, marking a first step toward an Arabic‑to‑English and French translation system under the ASMAT project.
Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for AD classification using probabilistic models across social media datasets. We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different conditions in social media context. Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable overall accuracy of 98%. This work is a first-step towards an ultimate goal of a translation system from Arabic to English and French, within the ASMAT project
| Year | Citations | |
|---|---|---|
Page 1
Page 1