A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural\n Networks

Abstract

The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is\noften considered to be Gaussian in the large data regime by assuming that the\nclassical central limit theorem (CLT) kicks in. This assumption is often made\nfor mathematical convenience, since it enables SGD to be analyzed as a\nstochastic differential equation (SDE) driven by a Brownian motion. We argue\nthat the Gaussianity assumption might fail to hold in deep learning settings\nand hence render the Brownian motion-based analyses inappropriate. Inspired by\nnon-Gaussian natural phenomena, we consider the GN in a more general context\nand invoke the generalized CLT (GCLT), which suggests that the GN converges to\na heavy-tailed $\\alpha$-stable random variable. Accordingly, we propose to\nanalyze SGD as an SDE driven by a L\\'{e}vy motion. Such SDEs can incur `jumps',\nwhich force the SDE transition from narrow minima to wider minima, as proven by\nexisting metastability theory. To validate the $\\alpha$-stable assumption, we\nconduct extensive experiments on common deep learning architectures and show\nthat in all settings, the GN is highly non-Gaussian and admits heavy-tails. We\nfurther investigate the tail behavior in varying network architectures and\nsizes, loss functions, and datasets. Our results open up a different\nperspective and shed more light on the belief that SGD prefers wide minima.\n

References

Page 1

	Year	Citations

Page 1