A Joint-Training Two-Stage Method For Remote Sensing Image Captioning

Abstract

Compared with remote sensing image (RSI) captioning methods based on the traditional encoder-decoder model, two-stage RSI captioning methods include an auxiliary remote sensing task to provide prior information, which enables them to generate more accurate descriptions. In previous two-stage RSI captioning methods, however, the image captioning and the auxiliary remote sensing tasks are handled separately, which is time-consuming and ignores mutual interference between tasks. To solve this problem, we propose a novel joint-training two-stage (JTTS) RSI captioning method. We use multi-label classification to provide prior information, and we design a differentiable sampling operator to replace the traditional non-differentiable sampling operation to index the multi-label classification result. In contrast to previous two-stage RSI captioning methods, our method can implement joint-training, and the joint loss allows the error of the generated description to flow into the optimization of the multi-label classification via back-propagation. Specifically, we approximate the Heaviside step function with the steep logistic function to implement a differentiable sampling operator for the multi-label classification. We propose a dynamic contrast loss function for multi-label classification task to ensure that a certain margin is maintained between the probabilities of the positive label and the negative label during sampling. We design an attribute-guided decoder to filter the multi-label prior information obtained by the sampling operator to generate more accurate image captions. The results of extensive experiments show that the JTTS method achieves state-of-the-art performance on the RSICD, the UCM-Captions, and the Sydney-Captions datasets.

References

Page 1

	Year	Citations

Page 1