SRViT: Self-Supervised Relation-Aware Vision Transformer for Hyperspectral Unmixing

Abstract

Vision transformer (ViT) has recently been a popular topic in the foundation model field, taking advantage of its strong scalability and outstanding representation capabilities. As a deep model, ViT introduces a new architecture for achieving hyperspectral image (HSI) unmixing. However, traditional ViTs overlook pixel-level spatial continuity by partitioning the input image into nonoverlapping fixed-size patches. This approach disrupts local structural relationships and hinders the model's ability to capture fine-grained spatial dependencies, resulting in suboptimal feature representation for dense prediction tasks in unmixing. To address these challenges, this article proposes the development of a self-supervised relation-aware ViT (SRViT). SRViT incorporates a self-embedded module comprising encoders, a pixel-level position encoder (PLPE), a self-supervised contrastive mechanism (SCM), and a decoder. The self-embedded module and PLPE preserve local correlations in HSI across different views, facilitating cross-view learning through SCM to ensure generalization. In addition, the decoder incorporates Kronecker-factored approximate curvature (K-FAC) to capture the local geometric structure of spectral information. Ultimately, SRViT learns endmembers and fractional abundance as the unmixing result. The effectiveness and competitiveness of SRViT have been systematically validated through comparative experiments, demonstrating its superior performance. The source code is available at the following link: https://github.com/yuanchaosu/TNNLS-SRViT.

References

Page 1

	Year	Citations

Page 1