Publication | Closed Access
Convolutional Neural Networks for Speech Recognition
2.3K
Citations
38
References
2014
Year
Speech PerceptionEngineeringMachine LearningVoiceComplex CorrelationsHealth SciencesMulti-speaker Speech RecognitionConvolutional Neural NetworksRobust Speech RecognitionSpeech ProcessingError RateComputer ScienceSpeech InputVoice RecognitionDeep LearningDistant Speech RecognitionSpeech CommunicationSpeech Recognition
Hybrid DNN‑HMM models have markedly improved speech recognition over conventional GMM‑HMM by capturing complex correlations in speech features. This study demonstrates that convolutional neural networks can further lower error rates in speech recognition. The authors employ a CNN architecture with limited weight sharing, local connectivity, weight sharing, and pooling to achieve invariance to small frequency‑axis shifts, thereby better modeling speaker and environmental variations. Experiments show CNNs reduce error rates by 6–10 % relative to DNNs on TIMIT phone recognition and large‑vocabulary voice‑search tasks.
Recently, the hybrid deep neural network (DNN)-hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.
| Year | Citations | |
|---|---|---|
Page 1
Page 1