TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning

Abstract

Remote sensing image captioning (RSIC) is an interesting but challenging cross-modal reasoning task for computer vision and natural language processing. Most of the recent popular approaches for RSIC utilize encoder-decoder architectures, which focus on visual features captured by convolutional neural network (CNN)-based encoder and semantic information by recurrent neural network (RNN)-based or long short-term memory (LSTM)-based decoder, but encounter difficulties with multiscale, multicategories, and direction ambiguity challenges. To make the most of semantic understanding ability of Transformers, in this article, we propose a new attention-based visual-linguistic reasoning framework with dual Transformer for RSIC. Specifically, Swin Transformer (SwinT) encoder with shifted window partitioning scheme is introduced for multiscale visual feature extraction to discover the intrinsic relationship in the objects, and then, a Transformer language model (TLM) with self-attention and cross attention is designed as the decoder to generate a well-formed sentence for the image. Extensive experiments are conducted on the public RSIC benchmark datasets, including UCM-Captions, Sydney-Captions, and RSICD. The impressive performance verifies the effectiveness and superiority of the proposed method. In addition, the source code and models of this work are publicly available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/LianYi233/TrTr-CMR</uri>.

References

Page 1

	Year	Citations

Page 1