DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation

TLDR

Automatic medical image segmentation has advanced with deep learning, yet existing Transformer‑based models often ignore pixel‑level structure within patches. This work introduces DS‑TransUNet, a dual Swin Transformer U‑Net designed to embed hierarchical Swin Transformers in both encoder and decoder to improve segmentation. DS‑TransUNet employs dual‑scale Swin Transformer encoders to capture coarse and fine features, a Transformer Interactive Fusion module for multi‑scale fusion, and a Swin Transformer block in the decoder to exploit long‑range context during up‑sampling. Experiments on four medical segmentation tasks show DS‑TransUNet significantly outperforms state‑of‑the‑art methods.

Abstract

Automatic medical image segmentation has made great progress owing to the powerful deep representation learning. Inspired by the success of self-attention mechanism in Transformer, considerable efforts are devoted to designing the robust variants of encoder-decoder architecture with Transformer. However, the patch division used in the existing Transformer-based models usually ignores the pixel-level intrinsic structural features inside each patch. In this paper, we propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet), which aims to incorporate the hierarchical Swin Transformer into both encoder and decoder of the standard U-shaped architecture. Our DS-TransUNet benefits from the self-attention computation in Swin Transformer and the designed dual-scale encoding, which can effectively model the non-local dependencies and multi-scale contexts for enhancing the semantic segmentation quality of varying medical images. Unlike many prior Transformer-based solutions, the proposed DS-TransUNet adopts a well-established dual-scale encoding mechanism that utilizes dual-scale encoders based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales. Meanwhile, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively perform the multi-scale information fusion through the self-attention mechanism. Furthermore, we introduce the Swin Transformer block into decoder to further explore the long-range contextual information during the up-sampling process. Extensive experiments across four typical tasks for medical image segmentation demonstrate the effectiveness of DS-TransUNet, and our approach significantly outperforms the state-of-the-art methods.

References

Page 1

	Year	Citations

Page 1