Publication | Closed Access
TDN: Temporal Difference Networks for Efficient Action Recognition
454
Citations
41
References
2021
Year
Unknown Venue
EngineeringMachine LearningTemporal Difference NetworkVideo ProcessingAction Recognition (Computer Vision)Video InterpretationImage AnalysisTemporal ModelingRobot LearningMachine VisionComputer ScienceVideo UnderstandingTemporal Difference NetworksDeep LearningComputer VisionTemporal DifferenceVideo AnalysisVideo HallucinationActivity Recognition
Temporal modeling remains a challenge for action recognition in videos. This paper proposes the Temporal Difference Network (TDN) to capture multi‑scale temporal information for efficient action recognition. TDN introduces a Temporal Difference Module that applies a temporal difference operator at two levels—frame‑wise for local motion and segment‑wise for global motion—to efficiently model short‑ and long‑term dynamics. TDN achieves state‑of‑the‑art performance on Something‑Something V1 & V2, matches top results on Kinetics‑400, adds only modest computational overhead, and is supported by extensive ablation and visualization studies; the code is publicly available.
Temporal modeling still remains challenging for action recognition in videos. To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition. The core of our TDN is to devise an efficient temporal module (TDM) by explicitly leveraging a temporal difference operator, and systematically assess its effect on short-term and long-term motion modeling. To fully capture temporal information over the entire video, our TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation. TDN provides a simple and principled temporal modeling framework and could be instantiated with the existing CNNs at a small extra computational cost. Our TDN presents a new state of the art on the Something-Something V1 & V2 datasets and is on par with the best performance on the Kinetics-400 dataset. In addition, we conduct in-depth ablation studies and plot the visualization results of our TDN, hopefully providing insightful analysis on temporal difference modeling. We release the code at https://github.com/MCG-NJU/TDN.
| Year | Citations | |
|---|---|---|
Page 1
Page 1