Publication | Closed Access
A corpus-based approach to expressive speech synthesis.
60
Citations
5
References
2004
Year
Speech CorpusCommunicationCorpus LinguisticsSpeech RecognitionNatural Language ProcessingComputational LinguisticsSpeech InterfaceProsody (Film Studies)Corpus AnalysisLanguage StudiesHealth SciencesSpeech ModelsSpeech SynthesisSpeech OutputText-to-speechSpeech CommunicationSpeech TechnologyVoiceSpeech AcousticsStyle ChannelSpeech ProcessingParalinguisticsGood NewsLinguisticsExpressive Synthetic Speech
Human speech comprises a lexical channel and a stylistic channel, each conveying information, yet current TTS systems provide only a single fixed style, falling short of human expressiveness. This paper introduces the IBM Expressive TTS Engine, designed to add a style channel by offering five distinct speaking styles. The engine generates neutral declarative, good‑news, bad‑news, question, and contrastive‑emphasis styles, enriches them with paralinguistic events such as sighs, breaths, and filled pauses, and allows users to specify expression via SSML extensions. Perceptual tests demonstrate significant differences between expressive and neutral synthetic speech across all five styles.
Human speech communication can be thought of as comprising two channels – the words themselves, and the style in which they are spoken. Each of these channels carries information. Today's most-advanced text-to-speech (TTS) systems such as [1],[2],[3],[4] fall far short of human speech because they offer only a single, fixed style of delivery, independent of the message. In this paper, we describe the IBM Expressive TTS Engine, which is able to add another channel by offering five speaking styles. These are: neutral declarative, conveying good news, conveying bad news, asking a question, and showing contrastive emphasis. In addition to generating speech in these five styles, our TTS system is also able to generate paralinguistic events such as sighs, breaths, and filled pauses which further enrich the style channel. We describe our methods for generating and evaluating expressive synthetic speech and paralinguistic effects. We show significant perceptual differences between expressive and neutral synthetic speech for each of our speaking styles. In addition, we describe how users have been empowered to easily communicate the desired expression to the TTS engine through our extensions [5] of the Speech Synthesis Markup Language (SSML) [6].
| Year | Citations | |
|---|---|---|
Page 1
Page 1