Publication | Open Access
ModelScope Text-to-Video Technical Report
46
Citations
0
References
2023
Year
Denoising UnetImage AnalysisMachine LearningEngineeringVideo SummarizationVideo HallucinationVideo InterpretationVideo UnderstandingDeep LearningContent AnalysisText-to-video Synthesis ModelStable DiffusionVideo SynthesisVideo ArticleComputer VisionData ModelingVideo Synthesizer
This paper introduces ModelScopeT2V, a text‑to‑video synthesis model derived from Stable Diffusion. ModelScopeT2V uses spatio‑temporal blocks and a 1.7‑billion‑parameter architecture comprising VQGAN, a text encoder, and a denoising UNet, with 0.5 billion parameters devoted to temporal modeling, enabling consistent frame generation and flexible frame counts. It outperforms state‑of‑the‑art baselines on three evaluation metrics. Code and an online demo are available at https://modelscope.cn/models/damo/text-to-video-synthesis/summary.
This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.