Publication | Closed Access
Deep Sequential Context Networks for Action Prediction
173
Citations
34
References
2017
Year
Unknown Venue
Machine VisionMachine LearningVideo AnalysisData ScienceAction Recognition (Movement Science)Action PredictionIncomplete Action ExecutionsAction VideosAction Recognition (Computer Vision)EngineeringAction Model LearningVideo HallucinationComputer ScienceVideo UnderstandingRobot LearningDeep LearningVideo InterpretationComputer Vision
Action prediction requires predicting labels from partially observed videos, unlike post‑hoc recognition. The paper proposes efficient deep networks for predicting actions from temporally incomplete, partially observed videos. The method enriches partial‑video features with sequential context, reconstructs missing information by learning from fully observed videos, orders temporal information, uses label cues to separate categories, and introduces a new efficient training formulation. Experiments on UCF101, Sports‑1M, and BIT show the approach outperforms state‑of‑the‑art methods, is up to 300× faster, and can predict many actions from only the first 10% of a video.
This paper proposes efficient and powerful deep networks for action prediction from partially observed videos containing temporally incomplete action executions. Different from after-the-fact action recognition, action prediction task requires action labels to be predicted from these partially observed videos. Our approach exploits abundant sequential context information to enrich the feature representations of partial videos. We reconstruct missing information in the features extracted from partial videos by learning from fully observed action videos. The amount of the information is temporally ordered for the purpose of modeling temporal orderings of action segments. Label information is also used to better separate the learned features of different categories. We develop a new learning formulation that enables efficient model training. Extensive experimental results on UCF101, Sports-1M and BIT datasets demonstrate that our approach remarkably outperforms state-of-the-art methods, and is up to 300× faster than these methods. Results also show that actions differ in their prediction characteristics, some actions can be correctly predicted even though only the beginning 10% portion of videos is observed.
| Year | Citations | |
|---|---|---|
Page 1
Page 1