Publication | Closed Access
Multimodal Inputs Driven Talking Face Generation With Spatial–Temporal Dependency
60
Citations
45
References
2020
Year
EngineeringMultimodal LearningSpeech RecognitionMultimodal LlmArbitrary Speech ClipAutomatic RecognitionHealth SciencesMultimodal Signal ProcessingAccurate Lip SynchronizationDeep LearningSpatial–temporal DependencySpeech CommunicationComputer VisionSpeech TechnologyFace VideoFacial AnimationSpeech AcousticsSpeech ProcessingSpeech InputSpeech PerceptionLinguistics
Given an arbitrary speech clip or text information as input, the proposed work aims to generate a talking face video with accurate lip synchronization. Existing works mainly have three limitations. (1) A single-modal learning is adopted with either audio or text as input, hence it lacks the complementarity of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multimodal inputs</i> . (2) Each frame is generated independently, hence it ignores the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">temporal dependency</i> between consecutive frames. (3) Each face image is generated by the traditional convolution neural network (CNN) with a local receptive field, hence it cannot effectively capture the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">spatial dependency</i> within internal representations of face images. To overcome these problems above, we decompose the talking face generation task into two steps: mouth landmarks prediction and video synthesis. First, a multimodal learning method is proposed to generate accurate mouth landmarks with multimedia inputs (both text and audio). Second, a network named Face2Vid is proposed to generate video frames conditioned on the predicted mouth landmarks. In Face2Vid, the optical flow is employed to model the temporal dependency between frames, meanwhile, a self-attention mechanism is introduced to model the spatial dependency across image regions. Extensive experiments demonstrate that our approach can generate photo-realistic video frames with the background, and exhibit the superiorities on accurate synchronization of lip movements and smooth transition of facial movements.
| Year | Citations | |
|---|---|---|
Page 1
Page 1