R-Drop: Regularized Dropout for Neural Networks

Abstract

Dropout is a powerful and widely used technique to regularize the training of deep neural networks. In this paper, we introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the freedom of the model parameters and complements dropout. Experiments on $\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English$\to$German translation ($\bf{30.91}$ BLEU) and WMT14 English$\to$French translation ($\bf{43.95}$ BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub{\url{https://github.com/dropreg/R-Drop}}.

References

Page 1

	Year	Citations
ImageNet classification with deep convolutional neural networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Communications of the ACM Convolutional Neural NetworkEngineeringMachine LearningNeural NetworkImagenet Classification	2017	75.5K
Rethinking the Inception Architecture for Computer Vision Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Convolutional Neural NetworkEngineeringMachine LearningInception ArchitectureImage Classification	2016	30.2K
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Ze Liu, Yutong Lin, Yue Cao, 2021 IEEE/CVF International Conference on Computer Vision (ICCV) Swin TransformerConvolutional Neural NetworkMachine VisionImage AnalysisMachine Learning	2021	27.9K
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe, Christian Szegedy arXiv (Cornell University) Data AugmentationDeep Neural NetworksMachine VisionMachine LearningData Science	2015	24.2K
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Kaiming He, Xiangyu Zhang, Shaoqing Ren, Convolutional Neural NetworkEngineeringMachine LearningAutoencodersImagenet Classification	2015	18.4K
HISTORIAE, History of Socio-Cultural Transformation as Linguistic Data Science. A Humanities Use Case Yinhan Liu, Myle Ott, Naman Goyal, DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2019	17.1K
Distilling the Knowledge in a Neural Network Geoffrey E. Hinton, Oriol Vinyals arXiv (Cornell University) Artificial IntelligenceEngineeringMachine LearningNeural NetworkAi Foundation	2015	13.9K
Improving neural networks by preventing co-adaptation of feature detectors Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, arXiv (Cornell University) Feature DetectorLarge Ai ModelConvolutional Neural NetworkMachine VisionMachine Learning	2012	6.6K
Dropout as a Bayesian Approximation: Representing Model Uncertainty in\n Deep Learning Yarin Gal, Zoubin Ghahramani arXiv (Cornell University)	2015	4.1K
Language Models are Few-Shot Learners T. B. Brown, Benjamin Mann, Nick Ryder, arXiv (Cornell University) Artificial IntelligenceFew-shot LearningLlm Fine-tuningEngineeringMachine Learning	2020	3K

Page 1