Publication | Open Access
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
428
Citations
28
References
2019
Year
Llm Fine-tuningEngineeringMachine LearningEducationWell-read StudentsMultilingual PretrainingLearning-by-doingLarge Language ModelPre-trained DistillationLanguage LearningCorpus LinguisticsPre-training Compact ModelsNatural Language ProcessingPre-trainingNatural Language RepresentationsData ScienceComputational LinguisticsLanguage AcquisitionLanguage StudiesJust-in-time LearningStandard Knowledge DistillationMachine TranslationLarge Ai ModelLearning SciencesPre-trained ModelsDeep LearningKnowledge DistillationLearning TheoryAdaptive LearningTechnology-enhanced Active LearningLinguistics
Large pre‑trained language models are costly, prompting compression methods, yet the simple approach of pre‑training and fine‑tuning compact models has been largely ignored. The authors show that pre‑training compact models is essential and that fine‑tuning them can rival more complex methods, and they release 24 pre‑trained miniature BERT models to support future research. They begin with pre‑trained compact models and apply knowledge distillation from large fine‑tuned models, then conduct extensive experiments to study how pre‑training and distillation interact across model size and unlabeled data characteristics. The study finds that pre‑training compact models is crucial, fine‑tuning them rivals advanced methods, the Pre‑trained Distillation algorithm yields further gains, and pre‑training and distillation compound even when applied sequentially on the same data.
Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.
| Year | Citations | |
|---|---|---|
Page 1
Page 1