Cross-Lingual Low Resource Speaker Adaptation Using Phonological Features

Abstract

The idea of using phonological features instead of phonemes as input to\nsequence-to-sequence TTS has been recently proposed for zero-shot multilingual\nspeech synthesis. This approach is useful for code-switching, as it facilitates\nthe seamless uttering of foreign text embedded in a stream of native text. In\nour work, we train a language-agnostic multispeaker model conditioned on a set\nof phonologically derived features common across different languages, with the\ngoal of achieving cross-lingual speaker adaptation. We first experiment with\nthe effect of language phonological similarity on cross-lingual TTS of several\nsource-target language combinations. Subsequently, we fine-tune the model with\nvery limited data of a new speaker's voice in either a seen or an unseen\nlanguage, and achieve synthetic speech of equal quality, while preserving the\ntarget speaker's identity. With as few as 32 and 8 utterances of target speaker\ndata, we obtain high speaker similarity scores and naturalness comparable to\nthe corresponding literature. In the extreme case of only 2 available\nadaptation utterances, we find that our model behaves as a few-shot learner, as\nthe performance is similar in both the seen and unseen adaptation language\nscenarios.\n

References

Page 1

	Year	Citations

Page 1