Concepedia

TLDR

Understanding the visual world requires recognizing not only objects but also their interactions, especially involving humans, making human‑object interaction detection a key problem. The study aims to detect (human, verb, object) triplets in everyday photos by leveraging the hypothesis that a person’s appearance cues can localize interacting objects. We propose InteractNet, a human‑centric model that jointly learns to detect people and objects, predicts action‑specific density maps over target locations, and fuses predictions to infer interaction triplets end‑to‑end. On V‑COCO and HICO‑DET datasets, the method achieves quantitatively compelling results.

Abstract

To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting (human, verb, object) triplets in challenging everyday photos. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the appearance of a person - their pose, clothing, action - is a powerful cue for localizing the objects they are interacting with. To exploit this cue, our model learns to predict an action-specific density over target object locations based on the appearance of a detected person. Our model also jointly learns to detect people and objects, and by fusing these predictions it efficiently infers interaction triplets in a clean, jointly trained end-to-end system we call InteractNet. We validate our approach on the recently introduced Verbs in COCO (V-COCO) and HICO-DET datasets, where we show quantitatively compelling results.

References

YearCitations

Page 1