Knowledge distillation: A good teacher is patient and consistent

Abstract

There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications. In this paper we address this issue and significantly bridge the gap between these two types of models. Throughout our empirical investigation we do not aim to necessarily propose a new method, but strive to identify a robust and effective recipe for making state-of-the-art large scale models affordable in practice. We demonstrate that, when performed correctly, knowledge distillation can be a powerful tool for reducing the size of large models without compromising their performance. In particular, we uncover that there are certain implicit design choices, which may drastically affect the effectiveness of distillation. Our key contribution is the explicit identification of these design choices, which were not previously articulated in the literature. We back up our findings by a comprehensive empirical study, demonstrate compelling results on a wide range of vision datasets and, in particular, obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8% top-1 accuracy.

References

Page 1

	Year	Citations
Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Image ClassificationDeep Neural NetworksMachine VisionImage AnalysisMachine Learning	2016	214.9K
Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba UvA-DARE (University of Amsterdam) Artificial IntelligenceMathematical ProgrammingModel OptimizationMachine VisionMachine Learning	2014	84.5K
Going deeper with convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Image ClassificationDeep Neural NetworksImage AnalysisMachine LearningData Science	2015	46.2K
ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky, Jia Deng, Hao Su, International Journal of Computer Vision Image ClassificationConvolutional Neural NetworkMachine VisionImage AnalysisEngineering	2015	39.5K
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Ze Liu, Yutong Lin, Yue Cao, 2021 IEEE/CVF International Conference on Computer Vision (ICCV) Swin TransformerConvolutional Neural NetworkMachine VisionImage AnalysisMachine Learning	2021	27.9K
Squeeze-and-Excitation Networks Jie Hu, Li Shen, Gang Sun Convolutional Neural NetworkMachine VisionMachine LearningNeural Networks (Machine Learning)Data Science	2018	26.8K
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe, Christian Szegedy arXiv (Cornell University) Data AugmentationDeep Neural NetworksMachine VisionMachine LearningData Science	2015	24.2K
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, arXiv (Cornell University) Convolutional Neural NetworkEngineeringMachine LearningImage RetrievalMultimodal Llm	2020	21.2K
Distilling the Knowledge in a Neural Network Geoffrey E. Hinton, Oriol Vinyals arXiv (Cornell University) Artificial IntelligenceEngineeringMachine LearningNeural NetworkAi Foundation	2015	13.9K
Squeeze-and-Excitation Networks Jie Hu, Li Shen, Samuel Albanie, IEEE Transactions on Pattern Analysis and Machine Intelligence	2019	12.3K

Page 1