Learning semantic relationships for better action retrieval in images

TLDR

Human actions encompass diverse interactions between people and objects, yet the vast action space makes it hard to gather enough training examples, and individual actions are often composites of smaller actions and mutually exclusive with others. The study aims to offset supervision sparsity by exploiting semantic relationships among actions, developing a method that reasons about these relationships to infer unseen actions, and proposing a neural network framework that jointly extracts action relationships to train improved retrieval models. The framework integrates linguistic, visual, and logical consistency cues to identify action relationships, and is trained and evaluated on a large‑scale human action image dataset. The approach achieves a significant mean AP improvement over baselines, including the HEX‑graph method.

Abstract

Human actions capture a wide variety of interactions between people and objects. As a result, the set of possible actions is extremely large and it is difficult to obtain sufficient training examples for all actions. However, we could compensate for this sparsity in supervision by leveraging the rich semantic relationship between different actions. A single action is often composed of other smaller actions and is exclusive of certain others. We need a method which can reason about such relationships and extrapolate unobserved actions from known actions. Hence, we propose a novel neural network framework which jointly extracts the relationship between actions and uses them for training better action retrieval models. Our model incorporates linguistic, visual and logical consistency based cues to effectively identify these relationships. We train and test our model on a largescale image dataset of human actions. We show a significant improvement in mean AP compared to different baseline methods including the HEX-graph approach from Deng et al. [8].

References

Page 1

	Year	Citations

Page 1