Publication | Closed Access
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
249
Citations
30
References
2020
Year
Unknown Venue
Convolutional Neural NetworkEngineeringMachine LearningExternal Language ModelSpeech RecognitionNatural Language ProcessingRobust Speech RecognitionLanguage ModelsReal-time LanguageMachine TranslationHealth SciencesLarge Ai ModelGlobal ContextComputer ScienceDeep LearningDistant Speech RecognitionSpeech CommunicationAutomatic Speech RecognitionContextnet.contextnet FeaturesMulti-speaker Speech RecognitionConvolutional Neural NetworksSpeech ProcessingSpeech InputSpeech Perception
Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind RNN/transformer based models in performance.In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet.ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules.In addition, we propose a simple scaling method that scales the widths of Con-textNet that achieves good trade-off between computation and accuracy.We demonstrate that on the widely used Librispeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6%without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0%with only 10M parameters on the clean/noisy LibriSpeech test sets.This compares to the best previously published model of 2.0%/4.6%with LM and 3.9%/11.3%with 20M parameters.The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.
| Year | Citations | |
|---|---|---|
Page 1
Page 1