Understanding deep learning requires rethinking generalization

TLDR

Deep neural networks, despite their large size, often exhibit a very small difference between training and test performance, a phenomenon traditionally attributed to model family properties or training regularization. The study seeks to show that conventional explanations fail to account for the strong generalization observed in large neural networks. To test this, the authors performed extensive systematic experiments on modern convolutional networks trained with stochastic gradient methods. The experiments revealed that these networks can perfectly fit random labels or random noise, and theoretical analysis demonstrates that depth‑two networks with more parameters than data points possess perfect finite‑sample expressivity, implying that regularization alone does not explain their generalization.

Abstract

Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.

References

Page 1

	Year	Citations
Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Image ClassificationDeep Neural NetworksMachine VisionImage AnalysisMachine Learning	2016	214.9K
ImageNet classification with deep convolutional neural networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Communications of the ACM Convolutional Neural NetworkEngineeringMachine LearningNeural NetworkImagenet Classification	2017	75.5K
ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky, Jia Deng, Hao Su, International Journal of Computer Vision Image ClassificationConvolutional Neural NetworkMachine VisionImage AnalysisEngineering	2015	39.5K
Dropout: a simple way to prevent neural networks from overfitting Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky,	2014	34.2K
Rethinking the Inception Architecture for Computer Vision Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Convolutional Neural NetworkEngineeringMachine LearningInception ArchitectureImage Classification	2016	30.2K
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe, Christian Szegedy arXiv (Cornell University) Data AugmentationDeep Neural NetworksMachine VisionMachine LearningData Science	2015	24.2K
Approximation by superpositions of a sigmoidal function George Cybenko Mathematics of Control Signals and Systems Computational NeuroscienceApproximation MethodInverse ProblemsMultivariate ApproximationApproximation Theory	1989	13.4K
The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network Peter L. Bartlett IEEE Transactions on Information Theory Artificial IntelligenceEngineeringMachine LearningNeural NetworkComputational Complexity	1998	1.2K
On Early Stopping in Gradient Descent Learning Yuan Yao, Lorenzo Rosasco, Andrea Caponnetto Constructive Approximation EngineeringMachine LearningComputational Learning TheoryEarly StoppingLarge Scale Optimization	2007	968
The Loss Surfaces of Multilayer Networks Anna Choromanska, Mikael Henaff, Michaël Mathieu, arXiv (Cornell University) EngineeringMachine LearningNeural NetworkLoss SurfacesNetwork Analysis	2014	714

Page 1