Using Curvature Information for Fast Stochastic Search

Abstract

We present an algorithm for fast stochastic gradient descent that uses a nonlinear adaptive momentum scheme to optimize the late time convergence rate. The algorithm makes effective use of curvature information, requires only O(n) storage and computation, and delivers convergence rates close to the theoretical optimum. We demonstrate the technique on linear and large nonlinear backprop networks. Improving Stochastic Search Learning algorithms that perform gradient descent on a cost function can be formulated in either stochastic (on-line) or batch form. The stochastic version takes the form ! t+1 = ! t + t G(! t ; x t ) (1) where ! t is the current weight estimate, t is the learning rate, G is minus the instantaneous gradient estimate, and x t is the input at time t 1 . One obtains the corresponding batch mode learning rule by taking constant and averaging G over all x. Stochastic learning provides several advantages over batch learning. For large datasets the batch aver...

References

Page 1

	Year	Citations

Page 1