Publication | Closed Access
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
118
Citations
35
References
2023
Year
Unknown Venue
MusicEngineeringMachine LearningMultimodal LearningVideo SummarizationVideo InterpretationSpeech RecognitionAttention Block BridgingJoint Denoising ProcessVideo RestorationVideo SynthesisVideo SynthesizerVideo GenerationMultimodal Signal ProcessingComputer ScienceVideo UnderstandingDeep LearningSignal ProcessingComputer VisionJoint AudioJoint Audio-video PairsSpeech ProcessingVideo HallucinationDiffusion-based ModelingArts
MM‑Diffusion uses a sequential multi‑modal U‑Net for joint denoising, differing from single‑modal diffusion models. The authors introduce MM‑Diffusion, a joint audio‑video generation framework that employs two coupled denoising autoencoders to produce high‑quality realistic videos. MM‑Diffusion employs a sequential multi‑modal U‑Net with separate audio and video subnets that denoise from Gaussian noise, linked by a random‑shift attention block for cross‑modal alignment. Experiments demonstrate that MM‑Diffusion outperforms baselines in unconditional audio‑video generation and zero‑shot conditional tasks, achieving the best FVD and FAD on Landscape and AIST++ datasets and winning 10k‑vote Turing tests. Code and pretrained models are available at https://github.com/researchmm/MM-Diffusion.
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zeroshot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.
| Year | Citations | |
|---|---|---|
Page 1
Page 1