Attention-Based End-to-End Speech Recognition on Voice Search

TLDR

End‑to‑end speech recognition aims to transcribe speech directly to text without predefined alignments, yet Mandarin’s logographic orthography, large vocabulary, and attention model dependencies have made it challenging. The study investigates an attention‑based encoder‑decoder model for Mandarin voice‑search recognition, employing character embeddings to handle the large vocabulary. Training employed L2 regularization, Gaussian weight noise, frame skipping, and compared two attention mechanisms with attention smoothing to capture long context. These techniques yield a character error rate of 3.58 % and sentence error rate of 7.43 %, improving to 2.81 % and 5.77 % respectively when combined with a trigram language model.

Abstract

Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.

References

Page 1

	Year	Citations

Page 1