Publication | Closed Access
Hierarchical Prosody Modeling and Control in Non-Autoregressive Parallel Neural TTS
16
Citations
26
References
2022
Year
EngineeringNeurolinguisticsSpoken Language ProcessingPhonologyAcoustic ModelingSpeech RecognitionNatural Language ProcessingPhoneticsSpeech Motor ControlLanguage StudiesOutput ProsodyMachine TranslationSpeech PerceptionLatent Prosody SpaceNatural SpeechSpeech SynthesisSpeech OutputDeep LearningText-to-speechSpeech CommunicationSpeech TechnologySpeech ProcessingHierarchical Prosody ModelingLinguisticsLanguage Generation
Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, we train a non-autoregressive parallel neural TTS front-end model hierarchically conditioned on both coarse and fine-grained acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension, generate a wide variety of speaking styles, and provide word-wise emphasis control, while maintaining equal or better quality to the baseline model.
| Year | Citations | |
|---|---|---|
Page 1
Page 1