Transformer-based Dual Relation Graph for Multi-label Image Recognition

Abstract

The simultaneous recognition of multiple objects in one image remains a challenging task, spanning multiple events in the recognition field such as various object scales, inconsistent appearances, and confused inter-class relationships. Recent research efforts mainly resort to the statistic label co-occurrences and linguistic word embedding to enhance the unclear semantics. Different from these researches, in this paper, we propose a novel Transformer-based Dual Relation learning framework, constructing complementary relationships by exploring two aspects of correlation, i.e., structural relation graph and semantic relation graph. The structural relation graph aims to capture long-range correlations from object context, by developing a cross-scale transformer-based architecture. The semantic graph dynamically models the semantic meanings of image objects with explicit semantic-aware constraints. In addition, we also incorporate the learnt structural relationship into the semantic graph, constructing a joint relation graph for robust representations. With the collaborative learning of these two effective relation graphs, our approach achieves new state-of-the-art on two popular multi-label recognition benchmarks, i.e. MS-COCO and VOC 2007 dataset.

References

Page 1

	Year	Citations
Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Image ClassificationDeep Neural NetworksMachine VisionImage AnalysisMachine Learning	2016	214.9K
Long Short-Term Memory Sepp Hochreiter, Jürgen Schmidhuber Neural Computation	1997	93.8K
Very Deep Convolutional Networks for Large-Scale Image Recognition Karen Simonyan, Andrew Zisserman arXiv (Cornell University) Geometric LearningConvolutional Neural NetworkEngineeringMachine LearningConvolutional Network Depth	2014	75.4K
MizAR 60 for Mizar 50 DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2023	73.5K
ImageNet: A large-scale hierarchical image database Jia Deng, Wei Dong, Richard Socher, 2009 IEEE Conference on Computer Vision and Pattern Recognition EngineeringMachine LearningImage RetrievalImage DatabaseImage Recognition (Computer Vision)	2009	60.2K
AI-Assisted Pipeline for Dynamic Generation of Trustworthy Health Supplement Content at Scale DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2018	45.3K
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, arXiv (Cornell University) Convolutional Neural NetworkEngineeringMachine LearningImage RetrievalMultimodal Llm	2020	21.2K
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Natural Language ProcessingMultimodal LlmConvolutional Neural NetworkMachine VisionMachine Learning	2017	20.1K
The Pascal Visual Object Classes (VOC) Challenge Mark Everingham, Luc Van Gool, Christopher K. I. Williams, International Journal of Computer Vision Image AnalysisMachine VisionEngineeringObject CategorizationPattern Recognition	2009	19K
Detecting Functionality-Specific Vulnerabilities via Retrieving Individual Functionality-Equivalent APIs in Open-Source Repositories Dagstuhl Research Online Publication Server	2025	16K

Page 1