Publication | Open Access
WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models
64
Citations
18
References
2021
Year
Llm Fine-tuningEngineeringMachine LearningMultilingual PretrainingLarge Language ModelCorpus LinguisticsSpeech RecognitionNatural Language ProcessingLarge Language ModelsData ScienceComputational LinguisticsLanguage StudiesMachine TranslationLarge Ai ModelGpt3 ModelPre-trained ModelsComputer ScienceDeep LearningChinese CharactersPre-training Language ModelsPre-trained Language ModelLinguistics
Using large-scale training data to build a pre-trained language model (PLM) with a larger volume of parameters can significantly improve downstream tasks. For example, OpenAI trained the GPT3 model with 175 billion parameters on 570 GB English training data, enabling downstream applications building with only a small number of samples. However, there is a lack of Chinese corpus to support large-scale PLMs. This paper introduces a super large-scale Chinese corpora WuDaoCorpora, containing about 3 TB training data and 1.08 trillion Chinese characters. We also release the base version of WuDaoCorpora, containing about 200 GB training data and 72 billion Chinese characters. As a baseline, we train a model transformer-XL with 3 billion parameters on the base version to test the corpora's effect. The results show that the models trained on this corpora can achieve excellent performance in Chinese. The data and model are available at https://data.wudaoai.cn and https://github.com/THUDM/Chinese-Transformer-XL, respectively.
| Year | Citations | |
|---|---|---|
Page 1
Page 1