On the Convergence of Stochastic Gradient Descent with Adaptive\n Stepsizes

Abstract

Stochastic gradient descent is the method of choice for large scale\noptimization of machine learning objective functions. Yet, its performance is\ngreatly variable and heavily depends on the choice of the stepsizes. This has\nmotivated a large body of research on adaptive stepsizes. However, there is\ncurrently a gap in our theoretical understanding of these methods, especially\nin the non-convex setting. In this paper, we start closing this gap: we\ntheoretically analyze in the convex and non-convex settings a generalized\nversion of the AdaGrad stepsizes. We show sufficient conditions for these\nstepsizes to achieve almost sure asymptotic convergence of the gradients to\nzero, proving the first guarantee for generalized AdaGrad stepsizes in the\nnon-convex setting. Moreover, we show that these stepsizes allow to\nautomatically adapt to the level of noise of the stochastic gradients in both\nthe convex and non-convex settings, interpolating between $O(1/T)$ and\n$O(1/\\sqrt{T})$, up to logarithmic terms.\n