Publication | Open Access
End-to-end attention-based large vocabulary speech recognition
124
Citations
24
References
2016
Year
Unknown Venue
EngineeringMachine LearningSpoken Language ProcessingRecurrent Neural NetworkSpeech RecognitionNatural Language ProcessingData ScienceComputational LinguisticsLanguage StudiesReal-time LanguageMachine TranslationSequence ModellingN-gram Language ModelDeep LearningSpeech CommunicationSpeech ProcessingSpeech InputSpeech PerceptionHidden Markov ModelsLinguistics
Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters [1,2]. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We investigate an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. We show how this setup can be applied to LVCSR by integrating the decoding RNN with an n-gram language model and by speeding up its operation by constraining selections made by the attention mechanism and by reducing the source sequence lengths by pooling information over time. Recognition accuracies similar to other HMM-free RNN-based approaches are reported for the Wall Street Journal corpus.
| Year | Citations | |
|---|---|---|
Page 1
Page 1