Publication | Open Access
BERTje: A Dutch BERT Model
214
Citations
9
References
2019
Year
EngineeringMultilingual PretrainingSemanticsLarge Language ModelSentiment AnalysisLanguage ProcessingText MiningSpeech RecognitionApplied LinguisticsNatural Language ProcessingInformation RetrievalComputational LinguisticsMultilingual Bert ModelLanguage StudiesLanguage ModelsMachine TranslationDutch Bert ModelNatural LanguageNlp TaskDiverse DatasetPre-trained ModelsCross-lingual Natural Language ProcessingLinguisticsPo Tagging
BERT, a transformer‑based pre‑trained language model, has advanced state‑of‑the‑art performance on many NLP tasks, yet multilingual BERT’s Dutch coverage is limited to Wikipedia text. The authors aim to develop and evaluate a monolingual Dutch BERT model, BERTje, using the same architecture and parameters as the original BERT. They trained BERTje on a large, diverse dataset of 2.4 billion tokens, employing the same architecture and hyperparameters as the original BERT. BERTje consistently outperforms the equally‑sized multilingual BERT model on downstream NLP tasks such as part‑of‑speech tagging, named‑entity recognition, semantic role labeling, and sentiment analysis, and the pre‑trained model is publicly available on GitHub.
The transformer-based pre-trained language model BERT has helped to improve state-of-the-art performance on many natural language processing (NLP) tasks. Using the same architecture and parameters, we developed and evaluated a monolingual Dutch BERT model called BERTje. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. BERTje consistently outperforms the equally-sized multilingual BERT model on downstream NLP tasks (part-of-speech tagging, named-entity recognition, semantic role labeling, and sentiment analysis). Our pre-trained Dutch BERT model is made available at https://github.com/wietsedv/bertje.
| Year | Citations | |
|---|---|---|
Page 1
Page 1