Publication | Open Access
UNETR: Transformers for 3D Medical Image Segmentation
210
Citations
35
References
2021
Year
Structured PredictionGeometric LearningMedical Image SegmentationEngineeringMachine LearningConvolutional Neural NetworkAutoencodersImage AnalysisData ScienceVideo TransformerRadiologyHealth SciencesMachine VisionMedical ImagingDeep LearningMedical Image ComputingComputer VisionBiomedical ImagingConvolutional Neural NetworksUnet TransformersMedical Image AnalysisImage Segmentation3D Imaging
Fully convolutional neural networks with encoder–decoder architectures dominate medical image segmentation, yet their convolutional locality limits learning of long‑range spatial dependencies. This study reformulates 3D medical image segmentation as a sequence‑to‑sequence problem and introduces UNETR, a transformer‑based encoder that captures global multi‑scale information while preserving the U‑shaped design. UNETR connects the transformer encoder directly to a decoder through skip connections at multiple resolutions to produce the final semantic segmentation. On the BTCV and MSD datasets, UNETR achieves state‑of‑the‑art performance, topping the BTCV leaderboard.
Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard.
| Year | Citations | |
|---|---|---|
Page 1
Page 1