Publication | Closed Access
Action Recognition in Still Images With Minimum Annotation Efforts
99
Citations
48
References
2016
Year
Artificial IntelligenceEngineeringMachine LearningHuman Pose EstimationHuman PosesVideo InterpretationHuman-object InteractionImage AnalysisData SciencePattern RecognitionRobot LearningMinimum Annotation EffortsVideo TransformerOnly Action AnnotationsHealth SciencesMachine VisionComputer ScienceVideo UnderstandingDeep LearningComputer VisionActivity Recognition
Still image human action recognition usually depends on human bounding boxes to capture human‑object interactions, which limits its practical use. This study seeks to remove the need for such bounding boxes during training and testing. The authors develop a systematic approach that trains models using only image‑level action labels, without requiring human bounding boxes. The approach attains comparable or superior accuracy to state‑of‑the‑art methods that use bounding boxes and can also segment precise human‑object interaction regions.
We focus on the problem of still image-based human action recognition, which essentially involves making prediction by analyzing human poses and their interaction with objects in the scene. Besides image-level action labels (e.g., riding, phoning), during both training and testing stages, existing works usually require additional input of human bounding boxes to facilitate the characterization of the underlying human-object interactions. We argue that this additional input requirement might severely discourage potential applications and is not very necessary. To this end, a systematic approach was developed in this paper to address this challenging problem of minimum annotation efforts, i.e., to perform recognition in the presence of only image-level action labels in the training stage. Experimental results on three benchmark data sets demonstrate that compared with the state-of-the-art methods that have privileged access to additional human bounding-box annotations, our approach achieves comparable or even superior recognition accuracy using only action annotations in training. Interestingly, as a by-product in many cases, our approach is able to segment out the precise regions of underlying human-object interactions.
| Year | Citations | |
|---|---|---|
Page 1
Page 1