ModelScope Text-to-Video Technical Report

TLDR

This paper introduces ModelScopeT2V, a text‑to‑video synthesis model derived from Stable Diffusion. ModelScopeT2V uses spatio‑temporal blocks and a 1.7‑billion‑parameter architecture comprising VQGAN, a text encoder, and a denoising UNet, with 0.5 billion parameters devoted to temporal modeling, enabling consistent frame generation and flexible frame counts. It outperforms state‑of‑the‑art baselines on three evaluation metrics. Code and an online demo are available at https://modelscope.cn/models/damo/text-to-video-synthesis/summary.

Abstract

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.