CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

Abstract

Prosody Transfer (PT) is a technique that aims to use the prosody from a\nsource audio as a reference while synthesising speech. Fine-grained PT aims at\ncapturing prosodic aspects like rhythm, emphasis, melody, duration, and\nloudness, from a source audio at a very granular level and transferring them\nwhen synthesising speech in a different target speaker's voice. Current\napproaches for fine-grained PT suffer from source speaker leakage, where the\nsynthesised speech has the voice identity of the source speaker as opposed to\nthe target speaker. In order to mitigate this issue, they compromise on the\nquality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT\nsystem that is robust to source speaker leakage, without using parallel data.\nWe achieve this through a novel reference encoder architecture capable of\ncapturing temporal prosodic representations which are robust to source speaker\nleakage. We compare CopyCat against a state-of-the-art fine-grained PT model\nthrough various subjective evaluations, where we show a relative improvement of\n$47\\%$ in the quality of prosody transfer and $14\\%$ in preserving the target\nspeaker identity, while still maintaining the same naturalness.\n

References

Page 1

	Year	Citations

Page 1