CMLFormer: CNN and Multiscale Local-Context Transformer Network for Remote Sensing Images Semantic Segmentation

Abstract

The characteristics of remote sensing images, such as complex ground objects, rich feature details, large intra-class variance and small inter-class variance, usually require deep learning semantic segmentation methods to have strong feature learning representation ability. Due to the limitation of convolutional operation, Convolutional Neural Networks (CNNs) are good at capturing local details, but perform poorly at modelling long-range dependencies. Transformers rely on multi-head selfattention mechanisms to extract global contextual information, but it usually leads to high complexity. Therefore, this paper proposes CNN and Multi-scale Local-context Transformer network (CMLFormer), a novel encoder-decoder structured network for remote sensing image semantic segmentation. Specifically, for the features extracted by the lightweight ResNet18 encoder, we design a transformer decoder based on Multi-scale Local-context Transform Block (MLTB) to enhance the ability of feature learning. By using a self-attention mechanism with non-overlapping windows and with the help of multi-scale horizontal and vertical interactive stripe convolution, MLTB is able to capture both local feature information and global feature information at different scales with low complexity. Additionally, the Feature Enhanced Module (FEM) is introduced into the encoder to further facilitate the learning of global and local information. Experimental results show that our proposed CMLFormer exhibits excellent performance on the Vaihingen and Potsdam datasets. The code is available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/DrWuHonglin/CMLFormer</uri> .

References

Page 1

	Year	Citations

Page 1