Concepedia

Publication | Closed Access

Word–Sentence Framework for Remote Sensing Image Captioning

114

Citations

51

References

2020

Year

TLDR

Remote sensing image captioning generates descriptive sentences for satellite images, typically using encoder–decoder models that lack explainability. This study introduces an explainable word–sentence framework to address that limitation. The framework separates captioning into word extraction and sentence generation, treating it as word classification and sorting, and is evaluated on Sydney‑captions, UCM‑captions, and RSICD datasets. Experiments demonstrate that the proposed method achieves performance comparable to state‑of‑the‑art encoder–decoder approaches.

Abstract

Remote sensing image captioning (RSIC), which aims at generating a well-formed sentence for a remote sensing image, has attracted more attention in recent years. The general framework for RSIC is the encoder–decoder architecture containing two submodels of encoder and decoder. Although the significant performance is obtained, the encoder–decoder architecture is a black-box model with a lack of explainability. To overcome this drawback, in this article, we propose a new explainable word–sentence framework for RSIC. The proposed word–sentence framework consists of two parts: word extractor and sentence generator, where the former extracts the valuable words in the given remote sensing image, while the latter organizes these words into a well-formed sentence. The proposed framework decomposes RSIC into a word classification task and a word sorting task, which is more in line with human intuitive understanding. On the basis of the word–sentence framework, some ablation experiments are conducted on the three public RSIC data sets of Sydney-captions, UCM-captions, and RSICD to explore the specific and effective network structures. In order to evaluate the proposed word–sentence framework objectively, we further conduct some comparative experiments on these three data sets and achieve comparable results in comparison with the encoder–decoder-based methods.

References

YearCitations

Page 1