Publication | Closed Access
Towards End-To-End Speech Recognition with Recurrent Neural Networks
1.9K
Citations
20
References
2014
Year
Unknown Venue
The authors propose an end‑to‑end speech recognizer that transcribes audio directly into text without using phonetic intermediates. They employ a deep bidirectional LSTM network trained with a modified Connectionist Temporal Classification objective that minimizes the expected transcription loss. The model attains 27.3 % WER on the Wall Street Journal corpus without linguistic resources, 21.9 % with a lexicon, 8.2 % with a trigram language model, and 6.7 % when combined with a baseline system.
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.
| Year | Citations | |
|---|---|---|
Page 1
Page 1