Natural Gradient Works Efficiently in Learning

TLDR

In parameter spaces with underlying structure, the ordinary gradient is not steepest, but the natural gradient—computed via information geometry—provides the true steepest direction for models such as perceptrons, blind source separation matrices, and linear dynamical systems. An adaptive method for updating the learning rate is proposed and analyzed. Natural gradient online learning is Fisher efficient, achieving asymptotically optimal batch estimation performance, and may alleviate the plateau problem seen in backpropagation for multilayer perceptrons.

Abstract

When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction, but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation), and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptrons, might disappear or might not be so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed.

References

Page 1

	Year	Citations

Page 1