Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

TLDR

We present SPEAR‑TTS, a multi‑speaker text‑to‑speech system that can be trained with minimal supervision. SPEAR‑TTS decomposes TTS into two sequence‑to‑sequence stages—text to high‑level semantic tokens and semantic tokens to low‑level acoustic tokens—allowing the speaking module to be trained on abundant audio‑only data and enabling speaker identity control via 3‑second example prompting without explicit speaker labels. Experiments show SPEAR‑TTS attains a character error rate competitive with state‑of‑the‑art methods using only 15 minutes of parallel data, while matching ground‑truth speech in naturalness and acoustic quality.

Abstract

Abstract We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to “reading”) and from semantic tokens to low-level acoustic tokens (“speaking”). Decoupling these two tasks enables training of the “speaking” module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the “reading” component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in naturalness and acoustic quality.

References

Page 1

	Year	Citations

Page 1