Publication | Closed Access
Learning to Detect Human-Object Interactions With Knowledge
155
Citations
39
References
2019
Year
Unknown Venue
Artificial IntelligenceEngineeringMachine LearningVideo InterpretationHuman-object InteractionNatural Language ProcessingMultimodal LlmImage AnalysisVisual ScenesData ScienceText-to-image RetrievalPattern RecognitionHuman-object InteractionsVisual GroundingVisual Question AnsweringRobot LearningHealth SciencesMachine VisionInstance-level Detection TasksVision Language ModelHoi DetectionComputer ScienceVideo UnderstandingDeep LearningComputer VisionActivity Recognition
Recent advances in instance‑level detection provide a foundation for automated visual scene understanding, yet fully comprehending social scenes remains elusive, and HOI detection seeks to localize humans, objects, and their complex interactions. This study aims to detect human‑object interactions in images, addressing the long‑tail challenge by modeling semantic regularities among verbs and objects. The authors build a knowledge graph from training annotations and external sources, then employ multi‑modal learning to retrieve dynamic image‑specific knowledge and enhance the semantic embedding space for HOI comprehension. The resulting method outperforms baselines on V‑COCO and HICO‑DET, particularly improving predictions for rare HOI categories.
The recent advances in instance-level detection tasks lay a strong foundation for automated visual scenes understanding. However, the ability to fully comprehend a social scene still eludes us. In this work, we focus on detecting human-object interactions (HOIs) in images, an essential step towards deeper scene understanding. HOI detection aims to localize human and objects, as well as to identify the complex interactions between them. Innat... Given the key observation that HOIs contain intrinsic semantic regularities despite they are visually diverse, we tackle the challenge of long-tail HOI categories by modeling the underlying regularities among verbs and objects in HOIs as well as general relationships. In particular, we construct a knowledge graph based on the ground-truth annotations of training dataset and external source. In contrast to direct knowledge incorporation, we address the necessity of dynamic image-specific knowledge retrieval by multi-modal learning, which leads to an enhanced semantic embedding space for HOI comprehension. The proposed method shows improved performance on V-COCO and HICO-DET benchmarks, especially when predicting the rare HOI categories.
| Year | Citations | |
|---|---|---|
Page 1
Page 1