Publication | Open Access
Learning Word Vectors for 157 Languages
143
Citations
0
References
2018
Year
EngineeringCross-lingual RepresentationSemanticsLarge Language ModelWord VectorsCorpus LinguisticsText MiningWord EmbeddingsApplied LinguisticsNatural Language ProcessingLanguage DocumentationInformation RetrievalData ScienceCommon Crawl ProjectComputational LinguisticsLanguage StudiesWord RepresentationsMachine TranslationNlp TaskLinguisticsPre-trained ModelsLexicon
Word vectors trained on large corpora have become state‑of‑the‑art for many NLP tasks. The study trains high‑quality word representations for 157 languages and introduces new analogy datasets for French, Hindi, and Polish. Models were trained on Wikipedia and Common Crawl data, and evaluated using the new analogy datasets. The resulting vectors achieve strong performance on ten languages, outperforming prior models.
Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.