MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation

Abstract

State Space Models (SSMs), particularly Mamba, have demonstrated significant potential in medical image segmentation due to their capability to model long-range dependencies with linear computational complexity. However, achieving accurate medical image segmentation necessitates the effective learning of both multi-scale detailed feature representations and global contextual dependencies. Although existing research has attempted to address this challenge by integrating CNNs and SSMs to leverage their respective strengths, they have not developed specialized modules to effectively capture multi-scale feature representations, nor have they sufficiently addressed the directional sensitivity issue when applying Mamba to 2D image data. To overcome these limitations, we propose a Multi-Scale Vision Mamba UNet model for medical image segmentation, termed MSVM-UNet. Specifically, by introducing multi-scale convolutions in the VSS blocks, we can more effectively capture and aggregate multi-scale feature representations from the hierarchical features of the VMamba encoder and better handle 2D visual data. Additionally, the large kernel patch expanding (LKPE) layers achieve more efficient upsampling of feature maps by simultaneously integrating spatial and channel information. Extensive experiments on the Synapse and ACDC datasets demonstrate that our approach is more effective than some state-of-the-art methods in capturing and aggregating multi-scale feature representations and modeling long-range dependencies between pixels. Our implementation is available at https://github.com/gndlwch2w/msvm-unet.

References

Page 1

	Year	Citations

Page 1