Code-Switched Speech Synthesis Using Bilingual Phonetic Posteriorgram with Only Monolingual Corpora

Abstract

Synthesizing fluent code-switched (CS) speech with consistent voice using only monolingual corpora is still a challenging task, since language alternation seldom occurs during training and the speaker identity is directly correlated with language. In this paper, we present a bilingual phonetic posteriorgram (PPG) based CS speech synthesizer using only monolingual corpora. The bilingual PPG is used to bridge across speakers and languages, which is formed by stacking two monolingual PPGs extracted from two monolingual speaker-independent speech recognition systems. It is assumed that bilingual PPG can represent the articulation of speech sounds speaker-independently and captures accurate phonetic information of both languages in the same feature space. The proposed model first extracts bilingual PPGs from training data. Then an encoder- decoder based model is used to learn the relationship between input text and bilingual PPGs, and the bilingual PPGs are mapped to acoustic features using bidirectional long-short term memory based model conditioned on speaker embedding to control speaker identity. Experiments validate the effectiveness of the proposed model in terms of speech intelligibility, audio fidelity and speaker consistency of the generated code-switched speech.

References

Page 1

	Year	Citations

Page 1