The Microsoft 2017 Conversational Speech Recognition System

TLDR

The 2017 Microsoft conversational speech recognition system updates the 2016 version with neural‑network acoustic and language models to push the state of the art on the Switchboard task. It incorporates a CNN‑BLSTM acoustic model, character‑based and dialog‑aware LSTM language models, and a two‑stage system combination that first fuses acoustic models at the senone/frame level and then applies word‑level voting via confusion networks, followed by confusion‑network rescoring. The system achieves a 5.1 % word error rate on the 2000‑utterance Switchboard evaluation set.

Abstract

We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring. For system combination we adopt a two-stage approach, whereby subsets of acoustic models are first combined at the senone/frame level, followed by a word-level voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1\% word error rate on the 2000 Switchboard evaluation set.

References

Page 1

	Year	Citations

Page 1