Publication | Open Access
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
197
Citations
35
References
2020
Year
Convolutional Neural NetworkScene AnalysisMachine VisionMachine LearningData ScienceImage AnalysisPattern RecognitionEngineeringScene InterpretationScene UnderstandingSemantic SegmentationSegmentation TransformerComputer ScienceSpatial ResolutionDeep LearningVideo TransformerComputer VisionMachine Translation
Semantic segmentation has largely relied on encoder‑decoder FCNs that progressively reduce resolution and enlarge receptive fields, with recent work adding dilated convolutions or attention to improve context modeling while the overall architecture remains unchanged. This study proposes treating segmentation as a sequence‑to‑sequence task by using a pure transformer to encode images as patch sequences. The transformer encoder, which models global context at every layer, is paired with a simple decoder to form the SEgmentation TRansformer (SETR). SETR sets new state‑of‑the‑art results on ADE20K (50.28 % mIoU) and Pascal Context (55.83 % mIoU), and tops the ADE20K test leaderboard upon submission.
Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
| Year | Citations | |
|---|---|---|
Page 1
Page 1