Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Abstract

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

References

Page 1

	Year	Citations
Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Image ClassificationDeep Neural NetworksMachine VisionImage AnalysisMachine Learning	2016	214.9K
Going deeper with convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Image ClassificationDeep Neural NetworksImage AnalysisMachine LearningData Science	2015	46.2K
ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky, Jia Deng, Hao Su, International Journal of Computer Vision Image ClassificationConvolutional Neural NetworkMachine VisionImage AnalysisEngineering	2015	39.5K
Fully convolutional networks for semantic segmentation Jonathan Long, Evan Shelhamer, Trevor Darrell	2015	36.2K
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Convolutional Neural NetworkEngineeringMachine LearningFeature DetectionRich Feature Hierarchies	2014	31.2K
Feature Pyramid Networks for Object Detection Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Feature Pyramid NetworksConvolutional Neural NetworkImage AnalysisMachine VisionMachine Learning	2017	27.7K
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Kaiming He, Xiangyu Zhang, Shaoqing Ren, Convolutional Neural NetworkEngineeringMachine LearningAutoencodersImagenet Classification	2015	18.4K
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe arXiv (Cornell University) Data AugmentationDeep Neural NetworksMachine VisionMachine LearningData Science	2024	15.6K
Backpropagation Applied to Handwritten Zip Code Recognition Yann LeCun, Bernhard E. Boser, J. S. Denker, Neural Computation Artificial IntelligenceConvolutional Neural NetworkEngineeringMachine LearningAi Foundation	1989	11.6K
Aggregated Residual Transformations for Deep Neural Networks Saining Xie, Ross Girshick, Piotr Dollár, Convolutional Neural NetworkEngineeringMachine LearningAutoencodersImage Classification	2017	11.6K

Page 1