Asynchronous stochastic gradient descent for DNN training

Abstract

It is well known that state-of-the-art speech recognition systems using deep neural network (DNN) can greatly improve the system performance compared with conventional GMM-HMM. However, what we have to pay correspondingly is the immense training cost due to the enormous parameters of DNN. Unfortunately, it is difficult to achieve parallelization of the minibatch-based back-propagation (BP) algorithm used in DNN training because of the frequent model updates. In this paper we describe an effective approach to achieve an approximation of BP - asynchronous stochastic gradient descent (ASGD), which is used to parallelize computing on multi-GPU. This approach manages multiple GPUs to work asynchronously to calculate gradients and update the global model parameters. Experimental results show that it achieves a 3.2 times speed-up on 4 GPUs than the single one, without any recognition performance loss.

References

Page 1

	Year	Citations

Page 1