NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

TLDR

Sentiment analysis is widely studied in NLP, yet most work focuses on high‑resource languages. The study introduces the first large‑scale human‑annotated Twitter sentiment dataset for Hausa, Igbo, Nigerian‑Pidgin, and Yoruba, comprising about 30,000 tweets per language and 14,000 for Pidgin, including many code‑mixed tweets. The authors collect, filter, process, and label tweets using automated pipelines, then evaluate a range of pre‑trained models and transfer‑learning strategies on the resulting datasets. They find that language‑specific models and language‑adaptive fine‑tuning yield the best performance, and they release the datasets, trained models, sentiment lexicons, and code to encourage further research.

Abstract

Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yor\`ub\'a ) consisting of around 30,000 annotated tweets per language (and 14,000 for Nigerian-Pidgin), including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing and labeling methods that enable us to create datasets for these low-resource languages. We evaluate a rangeof pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptivefine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivizeresearch on sentiment analysis in under-represented languages.