Visual Selection and Multistage Reasoning for RSVG

Abstract

Visual grounding of remote sensing (RSVG) is a task to locate targets indicated by referring expressions in remote sensing (RS) images. Previous approaches directly concatenate visual and language features, and stack a series of transformer encoders for cross-modal fusion. However, this fusion strategy fails to fully leverage attributes and contextual information of the targets in referring expressions, limiting the performance of existing methods. To address this issue, we propose a novel visual grounding framework for RSVG, named VSMR, which achieves accurate localization by adaptively selecting target-relevant features and performing multi-stage cross-modal reasoning. Specifically, we propose an Adaptive Feature Selection (AFS) module, which automatically selects visual features relevant to queries while suppressing background noises. A Multi-Stage Decoder (MSD) is designed to iteratively infer correlations between images and queries by leveraging abundant object attributes and contextual information in the referring expressions, thereby achieving accurate target localization. Experiments demonstrate our method is superior to other state-of-the-art (SoTA) methods, achieving accuracy of 78.24%.

References

Page 1

	Year	Citations

Page 1