Concepedia

TLDR

Despite advances in image diffusion, generating photorealistic, temporally coherent video remains difficult due to limited large‑scale video datasets and higher computational demands. The study finetunes a pretrained image diffusion model with video data to address video synthesis. They adapt the image noise prior to a video noise prior during finetuning, creating a model called Preserve Your Own Correlation (PYoCo). PYoCo achieves state‑of‑the‑art zero‑shot text‑to‑video performance on UCF‑101 and MSR‑VTT, and outperforms prior work on UCF‑101 with a ten‑fold smaller model and less computation. Project page: https://research.nvidia.com/labs/dir/pyoco/.

Abstract

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own COrrelation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a 10× smaller model using significantly less computation than the prior art. The project page is available at https://research.nvidia.com/labs/dir/pyoco/.

References

YearCitations

Page 1