Concepedia

Abstract

Although the vision transformer-based methods (ViTs) exhibit excellent performance than convolutional neural networks (CNNs) for image recognition tasks, their pixel-level semantic segmentation ability is limited due to the lack of explicit utilization of local biases. Recently, a variety of hybrid structures of ViT and CNN have been proposed, but these methods have poor multi-scale fusion ability and cannot accurately segment high-resolution and high-content complex land cover remote sensing images. Therefore, a dual encoder-decoder network named DEDNet is proposed in this work. In the encoding stage, the local and global information of the image is extracted by parallel CNN encoder and Transformer encoder. In the decoding stage, the cross-stage fusion (CF) module is constructed to achieve neighborhood attention guidance to enhance the positioning of small targets, effectively avoiding intra-class inconsistency. At the same time, the multi-head feature extraction (MFE) module is proposed to strengthen the recognition ability of the target boundary and effectively avoid inter-class ambiguity. Before outputting, the fusion spatial pyramid pooling (FSPP) classifier is proposed to merge the outputs of the two decoding strategies. The experiments demonstrate that the proposed model has superior generalization performance and can handle various semantic segmentation tasks of land cover.

References

YearCitations

Page 1