Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TLDR

Training deep neural networks is hampered by shifting layer input distributions, which slows learning and demands careful initialization, especially with saturating nonlinearities. The authors aim to mitigate internal covariate shift by normalizing layer inputs. They incorporate batch‑wise normalization directly into the network architecture, normalizing each mini‑batch during training. Batch Normalization enables higher learning rates, reduces the need for careful initialization, acts as a regularizer, cuts training steps by 14× while improving accuracy, and with ensembles surpasses human‑level performance on ImageNet with 4.9 % top‑5 error.

Abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

References

Page 1

	Year	Citations
Gradient-based learning applied to document recognition Yann LeCun, Léon Bottou, Yoshua Bengio, Proceedings of the IEEE EngineeringMachine LearningMultilayer Neural NetworksImage AnalysisData Science	1998	56.5K
Dropout: a simple way to prevent neural networks from overfitting Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky,	2014	34.2K
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Kaiming He, Xiangyu Zhang, Shaoqing Ren, Convolutional Neural NetworkEngineeringMachine LearningAutoencodersImagenet Classification	2015	18.4K
Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair, Geoffrey E. Hinton International Conference on Machine Learning Convolutional Neural NetworkEngineeringMachine LearningAutoencodersRecurrent Neural Network	2010	13.2K
Understanding the difficulty of training deep feedforward neural networks Xavier Glorot, Yoshua Bengio	2010	12.6K
Independent component analysis: algorithms and applications Aapo Hyvärinen, Erkki Oja Neural Networks Source SeparationEngineeringData SciencePattern RecognitionSignal Processing	2000	8.7K
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John C. Duchi, Elad Hazan, Yoram Singer	2010	8.6K
On the difficulty of training Recurrent Neural Networks Razvan Pascanu, Tomáš Mikolov, Yoshua Bengio arXiv (Cornell University) Vanishing Gradients ProblemGeometric LearningEngineeringMachine LearningSequential Learning	2012	3.8K
On the importance of initialization and momentum in deep learning Ilya Sutskever, James Martens, George E. Dahl,	2013	3.5K
Large Scale Distributed Deep Networks Jay B. Dean, Greg S. Corrado, Rajat Monga,	2012	2.9K

Page 1