VideoFlow: A Flow-Based Generative Model for Video

TLDR

Generative models that predict future events can capture complex real‑world phenomena, yet video prediction remains difficult because future frames are highly uncertain and existing probabilistic models are either computationally expensive or do not directly optimize likelihood. The authors propose a normalizing‑flow based video prediction model that directly optimizes data likelihood and produces high‑quality stochastic predictions. They model latent‑space dynamics with flow‑based generative models, demonstrating that normalizing flows can learn competitive video representations. This is the first multi‑frame video prediction system using normalizing flows, and the experiments show that flow‑based generative models are a viable and competitive approach to video modeling.

Abstract

Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. In particular, learning predictive models of videos offers an especially appealing mechanism to enable a rich understanding of the physical world: videos of real-world interactions are plentiful and readily available, and a model that can predict future video frames can not only capture useful representations of the world, but can be useful in its own right, for problems such as model-based robotic control. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally (as in the case of pixel-level autoregressive models), or do not directly optimize the likelihood of the data. In this work, we propose a model for video prediction based on normalizing flows, which allows for direct optimization of the data likelihood, and produces high-quality stochastic predictions. To our knowledge, our work is the first to propose multi-frame video prediction with normalizing flows. We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modeling of video.

References

Page 1

	Year	Citations

Page 1