On the importance of initialization and momentum in deep learning

TLDR

Deep and recurrent neural networks (DNNs and RNNs) were once thought almost impossible to train with stochastic gradient descent and momentum. The study demonstrates that appropriately initialized stochastic gradient descent with a slowly increasing momentum schedule can train deep and recurrent neural networks to high performance. Using a well‑designed random initialization and a gradually increasing momentum schedule, the authors train DNNs and RNNs on long‑term dependency datasets, achieving performance comparable to Hessian‑Free optimization. The results show that both initialization and momentum are essential; poorly initialized networks fail with momentum, and well‑initialized networks perform worse without or poorly tuned momentum, yet carefully tuned momentum alone suffices to handle curvature without second‑order methods.

Abstract

Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

References

Page 1

	Year	Citations

Page 1