Muppet: Massive Multi-task Representations with Pre-Finetuning

TLDR

The authors introduce pre‑finetuning, a large‑scale learning stage inserted between language‑model pre‑training and fine‑tuning. Pre‑finetuning consists of massively multi‑task learning on roughly 50 datasets totaling over 4.8 million labeled examples, designed to produce representations that generalize across many tasks. Pre‑finetuning consistently boosts performance for pretrained discriminators such as RoBERTa and generation models such as BART on diverse tasks, improves fine‑tuning sample efficiency, and demonstrates that performance scales linearly only after about 15 tasks, highlighting the importance of large‑scale multi‑tasking.

Abstract

We propose pre-finetuning, an additional large-scale learning stage between language model pre-training and fine-tuning. Pre-finetuning is massively multi-task learning (around 50 datasets, over 4.8 million total labeled examples), and is designed to encourage learning of representations that generalize better to many different tasks. We show that pre-finetuning consistently improves performance for pretrained discriminators (e.g. RoBERTa) and generation models (e.g. BART) on a wide range of tasks (sentence prediction, commonsense reasoning, MRC, etc. ), while also significantly improving sample efficiency during fine-tuning. We also show that large-scale multi-tasking is crucial; pre-finetuning can hurt performance when few tasks are used up until a critical point (usually above 15) after which performance improves linearly in the number of tasks.

References

Page 1

	Year	Citations

Page 1