Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

TLDR

Large‑scale pre‑training and multilingual modeling have led to a proliferation of web‑mined text datasets covering hundreds of languages. The study recommends evaluation and improvement techniques for multilingual corpora and discusses risks associated with low‑quality releases. The authors manually audited 205 language‑specific corpora from five major datasets and supplemented the audit with automatic analyses to detect quality issues. The audit revealed that many low‑resource corpora contain unusable text, have less than 50 % acceptable sentences, and are often mislabeled or use ambiguous language codes.

Abstract

Abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

References

Page 1

	Year	Citations

Page 1