Dynamic Image Networks for Action Recognition

TLDR

The authors propose dynamic images, a compact video representation that enables convolutional neural networks to process video data, and introduce an efficient approximate rank‑pooling operator to accelerate this process. Dynamic images are generated by applying rank pooling to raw video frames, producing a single RGB image per clip, and the authors extend this with an approximate rank‑pooling CNN layer that generalizes to dynamic feature maps. Using dynamic images and the approximate rank‑pooling layer, the authors achieve state‑of‑the‑art action‑recognition performance on standard benchmarks while allowing existing CNN models to be fine‑tuned directly on video data.

Abstract

We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used. The dynamic image is based on the rank pooling concept and is obtained through the parameters of a ranking machine that encodes the temporal evolution of the frames of the video. Dynamic images are obtained by directly applying rank pooling on the raw image pixels of a video producing a single RGB image per video. This idea is simple but powerful as it enables the use of existing CNN models directly on video data with fine-tuning. We present an efficient and effective approximate rank pooling operator, speeding it up orders of magnitude compared to rank pooling. Our new approximate rank pooling CNN layer allows us to generalize dynamic images to dynamic feature maps and we demonstrate the power of our new representations on standard benchmarks in action recognition achieving state-of-the-art performance.

References

Page 1

	Year	Citations

Page 1