Objects in Action: An Approach for Combining Action Understanding and Object Perception

TLDR

Analysis of videos of human‑object interactions requires understanding human movements, locating and recognizing objects, and observing the effects of those movements on objects, and recognition improves when these elements are considered together rather than independently, unlike traditional approaches that rely solely on shape features or motion analysis. The study proposes a Bayesian framework that unifies inference for object classification, localization, action understanding, and perception of object reaction. By embedding object classification and localization within a video‑interpretation framework and applying contextual cues from human movements and their effects on objects, the approach can localize and classify hard‑to‑detect objects and segment and recognize subtle actions.

Abstract

Analysis of videos of human-object interactions involves understanding human movements, locating and recognizing objects and observing the effects of human movements on those objects. While each of these can be conducted independently, recognition improves when interactions between these elements are considered. Motivated by psychological studies of human perception, we present a Bayesian approach which unifies the inference processes involved in object classification and localization, action understanding and perception of object reaction. Traditional approaches for object classification and action understanding have relied on shape features and movement analysis respectively. By placing object classification and localization in a video interpretation framework, we can localize and classify objects which are either hard to localize due to clutter or hard to recognize due to lack of discriminative features. Similarly, by applying context on human movements from the objects on which these movements impinge and the effects of these movements, we can segment and recognize actions which are either too subtle to perceive or too hard to recognize using motion features alone.

References

Page 1

	Year	Citations

Page 1