Deep Speech: Scaling up end-to-end speech recognition

TLDR

Traditional speech systems rely on engineered pipelines and perform poorly in noisy environments, whereas our architecture is simpler. We present a state‑of‑the‑art end‑to‑end deep‑learning speech recognition system. The system learns a robust function directly from data, omitting phoneme dictionaries, using a GPU‑optimized RNN and novel data‑synthesis techniques to generate large, varied training sets. Deep Speech achieves 16.0% error on Switchboard Hub5'00, outperforming prior results, and handles noisy environments better than state‑of‑the‑art commercial systems.

Abstract

We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

References

Page 1

	Year	Citations

Page 1