A Local–Global Interactive Vision Transformer for Aerial Scene Classification

Abstract

Generic image classification has been widely studied in the past decade. However, for the bird-view aerial images, aerial scene classification remains challenging due to the dramatic variation of the scale and object size. Existing methods usually learn the aerial scene representation from the convolutional neural networks (CNN), which focus on the local response of an image. In contrast, the recently-developed vision transformers (ViT) can learn stronger global representation for aerial scenes, but are not qualified enough to highlight the key objects in an aerial scene due to the dramatic size and scale variation. To address this challenge, in this paper, we propose a local-global interactive vision transformer (LG-ViT) for this task. It is based on our deliberately designed local-global feature interactive learning scheme, which intends to jointly utilize the local-wise and global-wise feature representations. To realize the learning scheme in an end-to-end manner, the proposed LG-ViT consists of three key components, namely, local-global feature extraction, local-global feature interaction, and local-global semantic constraints. Extensive experiments on three aerial scene classification benchmarks, namely, UCM, AID and NWPU, demonstrate the effectiveness of the proposed LG-ViT against the state-of-the-art methods. The effectiveness of each component and generalization capability is also validated.

References

Page 1

	Year	Citations

Page 1