Publication | Closed Access
Trainable videorealistic speech animation
56
Citations
43
References
2004
Year
Unknown Venue
Artificial IntelligenceEngineeringMachine LearningSpeech RecognitionNatural Language ProcessingVisual Speech ModuleReal-time LanguageHealth SciencesVideo SynthesizerSpeech Animation ModuleSpeech SynthesisSpeech OutputSynthesized UtteranceText-to-speechSpeech CommunicationSpeech TechnologySpeech ProcessingSpeech PerceptionSpeech Interface
The paper introduces a generative speech animation module that learns from a small set of mouth prototypes and a trajectory synthesis technique to produce novel mouth configurations from phonetically aligned audio, whether real or text‑to‑speech. It records a subject speaking a predetermined corpus, automatically trains a visual speech module to synthesize unseen mouth movements, and composites the synthesized mouth onto a background with natural head and eye motion. The resulting animation is videorealistic, appearing as a genuine video recording of the subject.
We describe how to create with machine learning techniques a generative, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.The two key contributions of this paper are 1) a variant of the multidimensional morphable model (MMM) to synthesize new, previously unseen mouth configurations from a small set of mouth image prototypes; and 2) a trajectory synthesis technique based on regularization, which is automatically trained from the recorded video corpus, and which is capable of synthesizing trajectories in MMM space corresponding to any desired utterance.
| Year | Citations | |
|---|---|---|
Page 1
Page 1