Concepedia

TLDR

The authors present a neural TTS system that can synthesize speech in the voice of many speakers, including unseen ones. The system comprises a speaker encoder trained on a large speaker‑verification dataset, a Tacotron‑2 style sequence‑to‑sequence network conditioned on the encoder’s embedding, and a WaveNet vocoder that converts the generated mel spectrogram into waveform. The model successfully transfers speaker variability learned by the encoder to synthesize natural speech from unseen speakers, and shows that training on a large, diverse speaker set and using random embeddings yields high‑quality novel‑speaker synthesis.

Abstract

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

References

YearCitations

Page 1