Publication | Closed Access
A Closer Look at Spatiotemporal Convolutions for Action Recognition
3.4K
Citations
34
References
2018
Year
Unknown Venue
EngineeringMachine LearningConvolutional FiltersVideo RetrievalVideo InterpretationImage AnalysisData SciencePattern RecognitionResidual LearningVideo TransformerDanceMachine VisionAction RecognitionCloser LookVideo UnderstandingDeep LearningComputer VisionVideo AnalysisVideo HallucinationActivity Recognition
2D CNNs applied to individual video frames have remained solid performers in action recognition. The paper investigates the accuracy advantages of 3D CNNs over 2D CNNs for video action recognition within residual learning, and discusses various spatiotemporal convolution designs. The authors evaluate multiple spatiotemporal convolution architectures, including 3D CNNs and factorized 3D filters, within a residual learning framework on action recognition datasets. The study shows that 3D CNNs outperform 2D CNNs, that factorizing 3D filters into spatial and temporal parts further improves accuracy, and that the proposed R(2+1)D block achieves state‑of‑the‑art results on Sports‑1M, Kinetics, UCF101, and HMDB51.
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.
| Year | Citations | |
|---|---|---|
Page 1
Page 1