Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks

TLDR

Grapheme‑to‑phoneme models are essential for speech recognition and text‑to‑speech, yet traditional joint‑sequence approaches require explicit grapheme‑to‑phoneme alignments that are difficult because graphemes and phonemes do not align one‑to‑one. This work proposes a grapheme‑to‑phoneme model based on a Long Short‑Term Memory recurrent neural network. The LSTM model leverages full grapheme context to convert entire words into pronunciations, implemented with unidirectional LSTMs of varying output delays and a deep bidirectional LSTM with a connectionist temporal classification layer. The LSTM approach eliminates the need for explicit alignments and achieves a 25.8 % word error rate on the CMU dataset, improving to 21.3 % when combined with a joint n‑gram model—a 9 % relative gain over the prior best 23.4 %.

Abstract

Grapheme-to-phoneme (G2P) models are key components in speech recognition and text-to-speech systems as they describe how words are pronounced. We propose a G2P model based on a Long Short-Term Memory (LSTM) recurrent neural network (RNN). In contrast to traditional joint-sequence based G2P approaches, LSTMs have the flexibility of taking into consideration the full context of graphemes and transform the problem from a series of grapheme-to-phoneme conversions to a word-to-pronunciation conversion. Training joint-sequence based G2P require explicit grapheme-to-phoneme alignments which are not straightforward since graphemes and phonemes don't correspond one-to-one. The LSTM based approach forgoes the need for such explicit alignments. We experiment with unidirectional LSTM (ULSTM) with different kinds of output delays and deep bidirectional LSTM (DBLSTM) with a connectionist temporal classification (CTC) layer. The DBLSTM-CTC model achieves a word error rate (WER) of 25.8% on the public CMU dataset for US English. Combining the DBLSTM-CTC model with a joint n-gram model results in a WER of 21.3%, which is a 9% relative improvement compared to the previous best WER of 23.4% from a hybrid system.

References

Page 1

	Year	Citations

Page 1