Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

TLDR

Despite advances in image diffusion, generating photorealistic, temporally coherent video remains difficult due to limited large‑scale video datasets and higher computational demands. The study finetunes a pretrained image diffusion model with video data to address video synthesis. They adapt the image noise prior to a video noise prior during finetuning, creating a model called Preserve Your Own Correlation (PYoCo). PYoCo achieves state‑of‑the‑art zero‑shot text‑to‑video performance on UCF‑101 and MSR‑VTT, and outperforms prior work on UCF‑101 with a ten‑fold smaller model and less computation. Project page: https://research.nvidia.com/labs/dir/pyoco/.

Abstract

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own COrrelation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a 10× smaller model using significantly less computation than the prior art. The project page is available at https://research.nvidia.com/labs/dir/pyoco/.

References

Page 1

	Year	Citations

Page 1