Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence\n Learning

Abstract

We present Deep Voice 3, a fully-convolutional attention-based neural\ntext-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural\nspeech synthesis systems in naturalness while training ten times faster. We\nscale Deep Voice 3 to data set sizes unprecedented for TTS, training on more\nthan eight hundred hours of audio from over two thousand speakers. In addition,\nwe identify common error modes of attention-based speech synthesis networks,\ndemonstrate how to mitigate them, and compare several different waveform\nsynthesis methods. We also describe how to scale inference to ten million\nqueries per day on one single-GPU server.\n