Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

TLDR

End‑to‑end learning replaces hand‑engineered pipelines with neural networks, enabling robust speech recognition across noisy environments, accents, and multiple languages. The goal is to accelerate iteration for discovering better architectures and algorithms. The system uses Batch Dispatch on GPUs to deploy efficiently online, achieving low latency at scale. The end‑to‑end system accurately recognizes English and Mandarin, achieves a 7× speedup via HPC, reduces experiment time from weeks to days, and matches human transcription quality on standard benchmarks.

Abstract

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

References

Page 1

	Year	Citations

Page 1