Publication | Closed Access
Detecting and Recognizing Human-Object Interactions
593
Citations
31
References
2018
Year
Unknown Venue
EngineeringMachine LearningActivity RecognitionHuman-object InteractionImage AnalysisPattern RecognitionHuman-object InteractionsRobot LearningMultimodal Human Computer InterfaceHealth SciencesMachine VisionObject DetectionVision Language ModelComputer ScienceVideo UnderstandingDeep LearningComputer VisionInteraction TripletsIndividual Object InstancesScene UnderstandingVisual WorldScene Modeling
Understanding the visual world requires recognizing not only objects but also their interactions, especially involving humans, making human‑object interaction detection a key problem. The study aims to detect (human, verb, object) triplets in everyday photos by leveraging the hypothesis that a person’s appearance cues can localize interacting objects. We propose InteractNet, a human‑centric model that jointly learns to detect people and objects, predicts action‑specific density maps over target locations, and fuses predictions to infer interaction triplets end‑to‑end. On V‑COCO and HICO‑DET datasets, the method achieves quantitatively compelling results.
To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting (human, verb, object) triplets in challenging everyday photos. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the appearance of a person - their pose, clothing, action - is a powerful cue for localizing the objects they are interacting with. To exploit this cue, our model learns to predict an action-specific density over target object locations based on the appearance of a detected person. Our model also jointly learns to detect people and objects, and by fusing these predictions it efficiently infers interaction triplets in a clean, jointly trained end-to-end system we call InteractNet. We validate our approach on the recently introduced Verbs in COCO (V-COCO) and HICO-DET datasets, where we show quantitatively compelling results.
| Year | Citations | |
|---|---|---|
Page 1
Page 1