Publication | Open Access
Video Generation From Text
39
Citations
30
References
2018
Year
EngineeringMachine LearningVideo SummarizationVideo AdaptationNatural Language ProcessingDynamic InformationImage AnalysisVideo SynthesizerSynthetic Image GenerationVideo GenerationGenerative ModelsHuman Image SynthesisAvailable Online VideosDeep LearningComputer VisionGenerative Adversarial NetworkVideo HallucinationGenerative AiInception Score
Generating videos from text remains a major challenge for current generative models. The study trains a conditional generative model to extract static and dynamic information from text for video generation. The authors propose a hybrid VAE–GAN framework that uses a static “gist” to sketch background and layout, transforms text into an image filter for dynamic features, and automatically creates a matched text‑video corpus from online videos for training. Experimental results show the framework produces plausible, diverse short videos that accurately reflect input text and outperform baseline text‑to‑image adapted models, as evaluated visually and by inception score.
Generating videos from text has proven to be a significant challenge for existing generative models. We tackle this problem by training a conditional generative model to extract both static and dynamic information from text. This is manifested in a hybrid framework, employing a Variational Autoencoder (VAE) and a Generative Adversarial Network (GAN). The static features, called "gist," are used to sketch text-conditioned background color and object layout structure. Dynamic features are considered by transforming input text into an image filter. To obtain a large amount of data for training the deep-learning model, we develop a method to automatically create a matched text-video corpus from publicly available online videos. Experimental results show that the proposed framework generates plausible and diverse short-duration smooth videos, while accurately reflecting the input text information. It significantly outperforms baseline models that directly adapt text-to-image generation procedures to produce videos. Performance is evaluated both visually and by adapting the inception score used to evaluate image generation in GANs.
| Year | Citations | |
|---|---|---|
Page 1
Page 1