Towards End-To-End Speech Recognition with Recurrent Neural Networks

TLDR

The authors propose an end‑to‑end speech recognizer that transcribes audio directly into text without using phonetic intermediates. They employ a deep bidirectional LSTM network trained with a modified Connectionist Temporal Classification objective that minimizes the expected transcription loss. The model attains 27.3 % WER on the Wall Street Journal corpus without linguistic resources, 21.9 % with a lexicon, 8.2 % with a trigram language model, and 6.7 % when combined with a baseline system.

Abstract

This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.

References

Page 1

	Year	Citations

Page 1