RoBERT – A Romanian BERT Model

TLDR

Deep pre‑trained language models are ubiquitous in NLP, learning contextualized representations from large unlabeled corpora and achieving state‑of‑the‑art results, but options for languages other than English are limited and often rely on multilingual corpora. This work introduces RoBERT, a Romanian‑only pre‑trained BERT model, and evaluates it against multilingual models on seven Romanian NLP tasks. RoBERT is trained on Romanian text and tested on sentiment analysis, dialect and cross‑dialect topic identification, and diacritics restoration, comparing its performance to multilingual and other monolingual BERT variants. RoBERT outperforms both multilingual models and another Romanian BERT implementation on all evaluated tasks.

Abstract

Deep pre-trained language models tend to become ubiquitous in the field of Natural Language Processing (NLP). These models learn contextualized representations by using a huge amount of unlabeled text data and obtain state of the art results on a multitude of NLP tasks, by enabling efficient transfer learning. For other languages besides English, there are limited options of such models, most of which are trained only on multi-lingual corpora. In this paper we introduce a Romanian-only pre-trained BERT model – RoBERT – and compare it with different multi-lingual models on seven Romanian specific NLP tasks grouped into three categories, namely: sentiment analysis, dialect and cross-dialect topic identification, and diacritics restoration. Our model surpasses the multi-lingual models, as well as a another mono-lingual implementation of BERT, on all tasks.

References

Page 1

	Year	Citations

Page 1