Publication | Closed Access
Temporal Convolutional Networks for Action Segmentation and Detection
1.9K
Citations
34
References
2017
Year
Unknown Venue
Machine VisionMachine LearningImage AnalysisEngineeringPattern RecognitionTemporal Convolutional NetworksVideo SummarizationTemporal ClassifierVideo HallucinationVideo TransformerVideo UnderstandingRobot LearningDeep LearningActivity RecognitionTemporal ConvolutionsVideo InterpretationComputer Vision
Identifying and temporally segmenting fine‑grained human actions in video is essential for robotics, surveillance, education, and related fields. We propose Temporal Convolutional Networks (TCNs) that use hierarchical temporal convolutions for fine‑grained action segmentation or detection. The Encoder‑Decoder TCN employs pooling and upsampling to capture long‑range temporal patterns, while the Dilated TCN uses dilated convolutions to efficiently model extended dependencies. TCNs capture action compositions, segment durations, and long‑range dependencies, train more than ten times faster than LSTM‑based RNNs, and achieve large improvements on three challenging fine‑grained datasets.
The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.
| Year | Citations | |
|---|---|---|
Page 1
Page 1