Hybrid speech recognition with Deep Bidirectional LSTM

TLDR

Deep Bidirectional LSTM recurrent neural networks have achieved state‑of‑the‑art performance on the TIMIT speech database, yet their RNN‑specific objective functions hinder integration with large‑vocabulary speech recognition systems. This study evaluates DBLSTM as an acoustic model within a conventional neural‑network HMM hybrid system and seeks to identify strategies for translating its superior frame‑level accuracy into lower word error rates. The DBLSTM‑HMM hybrid attains TIMIT results comparable to prior work, surpasses GMM and deep‑network baselines on a subset of the Wall Street Journal corpus, but yields only modest word‑error‑rate gains despite markedly higher frame‑level accuracy, indicating its strength lies in tasks dominated by acoustic modeling.

Abstract

Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.

References

Page 1

	Year	Citations
Long Short-Term Memory Sepp Hochreiter, Jürgen Schmidhuber Neural Computation	1997	93.8K
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups Geoffrey E. Hinton, Li Deng, Dong Yu, IEEE Signal Processing Magazine EngineeringMachine LearningAcoustic ModelingSpeech RecognitionData Science	2012	10.2K
Bidirectional recurrent neural networks Mike Schuster, Kuldip K. Paliwal IEEE Transactions on Signal Processing Natural Language ProcessingStructured PredictionConditional Posterior ProbabilityEngineeringMachine Learning	1997	9.6K
Speech recognition with deep recurrent neural networks Alex Graves, Abdelrahman Mohamed, Geoffrey E. Hinton Natural Language ProcessingDeep Neural NetworksRnn PerformanceMachine LearningEngineering	2013	8.7K
Connectionist temporal classification Alex Graves, Santiago Fernández, Faustino Gomez, EngineeringMachine LearningSpoken Language ProcessingRecurrent Neural NetworkSpeech Recognition	2006	5.3K
Framewise phoneme classification with bidirectional LSTM and other neural network architectures Alex Graves, Jürgen Schmidhuber Neural Networks Natural Language ProcessingFramewise Phoneme ClassificationEngineeringMachine LearningSpeech Processing	2005	5.2K
Kaldi Speech Recognition Toolkit Daniel Povey Infoscience (Ecole Polytechnique Fédérale de Lausanne)	2024	4.9K
Deep Neural Networks for Acoustic Modeling in Speech Recognition Geoffrey E. Hinton, Li Deng, Dong Yu,	2012	1.9K
Acoustic Modeling Using Deep Belief Networks Abdelrahman Mohamed, George E. Dahl, Geoffrey E. Hinton IEEE Transactions on Audio Speech and Language Processing EngineeringMachine LearningGaussian Mixture ModelsAcoustic ModelingSpeech Recognition	2011	1.7K
Sequence Transduction with Recurrent Neural Networks Alex Graves arXiv (Cornell University) Structured PredictionEngineeringMachine LearningSequential LearningRecurrent Neural Network	2012	1.3K

Page 1