Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

TLDR

Deep belief network pre‑training robustly initializes deep neural networks, improving optimization and reducing generalization error. The authors propose a novel context‑dependent DNN‑HMM model for large‑vocabulary speech recognition that leverages deep belief networks and detail its components, application procedure, and modeling choices. The model is a pre‑trained DNN‑HMM hybrid that outputs a distribution over senones, trained using deep belief network pre‑training. On a business search dataset, the CD‑DNN‑HMM achieves 5.8–9.2% absolute sentence‑accuracy gains (16–23% relative error reduction) over conventional CD‑GMM‑HMMs trained with MPE or ML.

Abstract

We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.

References

Page 1

	Year	Citations
Learning representations by back-propagating errors David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams Nature EngineeringMachine LearningFeature LearningPattern RecognitionKnowledge Discovery	1986	29.7K
Reducing the Dimensionality of Data with Neural Networks Geoffrey E. Hinton, Ruslan Salakhutdinov Science	2006	20.5K
A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton, Simon Osindero, Yee‐Whye Teh Neural Computation	2006	16.2K
Understanding the difficulty of training deep feedforward neural networks Xavier Glorot, Yoshua Bengio	2010	12.6K
Extracting and composing robust features with denoising autoencoders Pascal Vincent, Hugo Larochelle, Yoshua Bengio, EngineeringMachine LearningAutoencodersRobust FeaturesRobust Feature	2008	7.2K
Learning Deep Architectures for AI Yoshua Bengio Foundations and Trends® in Machine Learning Artificial IntelligenceAi ArchitectureDeep Neural NetworksEngineeringComputer Science	2009	6.8K
A unified architecture for natural language processing Ronan Collobert, Jason Weston EngineeringMachine LearningCross-lingual RepresentationMultilingual PretrainingCorpus Linguistics	2008	5.2K
Training Products of Experts by Minimizing Contrastive Divergence Geoffrey E. Hinton Neural Computation Artificial IntelligenceRenormalization TermEngineeringMachine LearningData Science	2002	4.9K
What is the best multi-stage architecture for object recognition? Kevin Jarrett, Koray Kavukcuoglu, M. Ranzato, Best Multi-stage ArchitectureConvolutional Neural NetworkFeature Extraction StagesMachine LearningFeature Detection	2009	2.2K
Why Does Unsupervised Pre-training Help Deep Learning? Dumitru Erhan	2010	2.1K

Page 1