Publication | Closed Access
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition
122
Citations
30
References
2020
Year
Unknown Venue
E2e ModelsEngineeringMachine LearningPopular End-to-end ModelsSpoken Language ProcessingSpeech RecognitionNatural Language ProcessingData ScienceRobust Speech RecognitionVoice RecognitionReal-time LanguageMachine TranslationHealth SciencesComputer ScienceDeep LearningDistant Speech RecognitionSpeech CommunicationMulti-speaker Speech RecognitionTransformer-aed ModelsSpeech ProcessingSpeech InputSpeech PerceptionLinguisticsE2e Methods
Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition.Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attentionbased encoder-decoder (AED), and Transformer-AED.In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes.We use 65 thousand hours of Microsoft anonymized training data to train these models.As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data.To the best of our knowledge, no such comprehensive study has been conducted yet.We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized.Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode.We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.
| Year | Citations | |
|---|---|---|
Page 1
Page 1