STM: SpatioTemporal and Motion Encoding for Action Recognition

TLDR

Spatiotemporal and motion features are complementary and crucial for video action recognition, with state‑of‑the‑art methods using separate 3D CNN and flow streams. This work aims to efficiently encode both features within a unified 2D framework. We introduce a STM block comprising a Channel‑wise SpatioTemporal Module and a Channel‑wise Motion Module, replacing residual blocks in ResNet to build a lightweight STM network. Experiments show the STM network surpasses state‑of‑the‑art on temporal‑focused datasets (Something‑Something v1 & v2, Jester) and scene‑focused datasets (Kinetics‑400, UCF‑101, HMDB‑51) by jointly encoding spatiotemporal and motion features.

Abstract

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose a STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.

References

Page 1

	Year	Citations

Page 1