Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

TLDR

The paper introduces Tacotron 2, a neural network architecture that synthesizes speech directly from text. Tacotron 2 consists of a recurrent sequence‑to‑sequence network that converts character embeddings into mel‑scale spectrograms, which are then fed into a modified WaveNet vocoder; ablation studies confirm the effectiveness of using mel spectrograms as conditioning. The model attains a mean opinion score of 4.53, close to the 4.58 score of professional recordings, and the use of mel spectrograms enables a substantial reduction in the WaveNet architecture size.

Abstract

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and F0 features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.

References

Page 1

	Year	Citations

Page 1