Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?

TLDR

Remote sensing image captioning is an important yet understudied AI task that requires capturing multi‑scale ground elements, their attributes, and interactions. This study asks whether a machine can produce human‑like language descriptions for remote sensing images. The authors propose a captioning framework that uses deep learning and fully convolutional networks to generate descriptions. Experiments on Google Earth and GaoFen‑2 high‑resolution images show the method produces robust, comprehensive sentences with good speed.

Abstract

This paper investigates an intriguing question in the remote sensing field: "can a machine generate humanlike language descriptions for a remote sensing image?" The automatic description of a remote sensing image (namely, remote sensing image captioning) is an important but rarely studied task for artificial intelligence. It is more challenging as the description must not only capture the ground elements of different scales, but also express their attributes as well as how these elements interact with each other. Despite the difficulties, we have proposed a remote sensing image captioning framework by leveraging the techniques of the recent fast development of deep learning and fully convolutional networks. The experimental results on a set of high-resolution optical images including Google Earth images and GaoFen-2 satellite images demonstrate that the proposed method is able to generate robust and comprehensive sentence description with desirable speed performance.

References

Page 1

	Year	Citations

Page 1