Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

TLDR

Large pre‑trained language models are costly, prompting compression methods, yet the simple approach of pre‑training and fine‑tuning compact models has been largely ignored. The authors show that pre‑training compact models is essential and that fine‑tuning them can rival more complex methods, and they release 24 pre‑trained miniature BERT models to support future research. They begin with pre‑trained compact models and apply knowledge distillation from large fine‑tuned models, then conduct extensive experiments to study how pre‑training and distillation interact across model size and unlabeled data characteristics. The study finds that pre‑training compact models is crucial, fine‑tuning them rivals advanced methods, the Pre‑trained Distillation algorithm yields further gains, and pre‑training and distillation compound even when applied sequentially on the same data.

Abstract

Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.

References

Page 1

	Year	Citations

Page 1