Publication | Closed Access
Attention-Based End-to-End Speech Recognition on Voice Search
73
Citations
20
References
2018
Year
Unknown Venue
EngineeringMachine LearningSpoken Language ProcessingMultilingual PretrainingLarge Language ModelCorpus LinguisticsSpeech RecognitionNatural Language ProcessingComputational LinguisticsVoice RecognitionReal-time LanguageMachine TranslationHealth SciencesVoice SearchMandarin Speech RecognitionAttention ModelDeep LearningCharacter Error RateSpeech CommunicationVoiceSpeech ProcessingSpeech InputSpeech PerceptionLinguistics
End‑to‑end speech recognition aims to transcribe speech directly to text without predefined alignments, yet Mandarin’s logographic orthography, large vocabulary, and attention model dependencies have made it challenging. The study investigates an attention‑based encoder‑decoder model for Mandarin voice‑search recognition, employing character embeddings to handle the large vocabulary. Training employed L2 regularization, Gaussian weight noise, frame skipping, and compared two attention mechanisms with attention smoothing to capture long context. These techniques yield a character error rate of 3.58 % and sentence error rate of 7.43 %, improving to 2.81 % and 5.77 % respectively when combined with a trigram language model.
Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.
| Year | Citations | |
|---|---|---|
Page 1
Page 1