BERTje: A Dutch BERT Model

TLDR

BERT, a transformer‑based pre‑trained language model, has advanced state‑of‑the‑art performance on many NLP tasks, yet multilingual BERT’s Dutch coverage is limited to Wikipedia text. The authors aim to develop and evaluate a monolingual Dutch BERT model, BERTje, using the same architecture and parameters as the original BERT. They trained BERTje on a large, diverse dataset of 2.4 billion tokens, employing the same architecture and hyperparameters as the original BERT. BERTje consistently outperforms the equally‑sized multilingual BERT model on downstream NLP tasks such as part‑of‑speech tagging, named‑entity recognition, semantic role labeling, and sentiment analysis, and the pre‑trained model is publicly available on GitHub.

Abstract

The transformer-based pre-trained language model BERT has helped to improve state-of-the-art performance on many natural language processing (NLP) tasks. Using the same architecture and parameters, we developed and evaluated a monolingual Dutch BERT model called BERTje. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. BERTje consistently outperforms the equally-sized multilingual BERT model on downstream NLP tasks (part-of-speech tagging, named-entity recognition, semantic role labeling, and sentiment analysis). Our pre-trained Dutch BERT model is made available at https://github.com/wietsedv/bertje.

References

Page 1

	Year	Citations

Page 1