Publication | Closed Access
Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition
105
Citations
52
References
2021
Year
Unknown Venue
EngineeringMachine LearningAction Quality AssessmentVideo InterpretationImage AnalysisData SciencePattern RecognitionDeep AnalysisRobot LearningVideo TransformerMachine VisionFair ComparisonComputer ScienceVideo UnderstandingDeep LearningComputer VisionVideo Action RecognitionConvolutional Neural NetworksActivity Recognition
In recent years, a number of approaches based on 2D or 3D convolutional neural networks (CNN) have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop an unified framework for both 2D-CNN and 3D-CNN action models, which enables us to remove bells and whistles and provides a common ground for fair comparison. We then conduct an effort towards a large-scale analysis involving over 300 action recognition models. Our comprehensive analysis reveals that a) a significant leap is made in efficiency for action recognition, but not in accuracy; b) 2D-CNN and 3D-CNN models behave similarly in terms of spatio-temporal representation abilities and transferability. Our codes are available at https://github.com/IBM/action-recognition-pytorch.
| Year | Citations | |
|---|---|---|
Page 1
Page 1