Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning

TLDR

Vision Transformers excel on low‑resolution images, yet whole‑slide pathology images reach gigapixel scales and exhibit a hierarchical token structure from cellular to tissue‑level resolutions. This work proposes the Hierarchical Image Pyramid Transformer, a ViT that exploits the natural hierarchy of whole‑slide images through two levels of self‑supervised learning to produce high‑resolution representations. HIPT was pretrained on 33 cancer types using 10,678 gigapixel slides, 408,218 4096 × 4096 patches, and 104 million 256 × 256 patches. On nine slide‑level benchmarks, HIPT outperforms current state‑of‑the‑art methods for cancer subtyping and survival prediction, demonstrating that self‑supervised ViTs capture key inductive biases of the tumor microenvironment’s hierarchical phenotypes.

Abstract

Vision Transformers (ViTs) and their multi-scale and hierarchical variations have been successful at capturing image representations but their use has been generally studied for low-resolution images (e.g. 256 × 256, 384 × 384). For gigapixel whole-slide imaging (WSI) in computational pathology, WSIs can be as large as 150000 × 150000 pixels at 20 × magnification and exhibit a hierarchical structure of visual tokens across varying resolutions: from 16 × 16 images capturing individual cells, to 4096 × 4096 images characterizing interactions within the tissue microenvironment. We introduce a new ViT architecture called the Hierarchical Image Pyramid Transformer (HIPT), which leverages the natural hierarchical structure inherent in WSIs using two levels of self-supervised learning to learn high-resolution image representations. HIPT is pretrained across 33 cancer types using 10,678 gigapixel WSIs, 408,218 4096 × 4096 images, and 104M 256 × 256 images. We benchmark HIPT representations on 9 slide-level tasks, and demonstrate that: 1) HIPT with hierarchical pretraining outperforms current state-of-the-art methods for cancer subtyping and survival prediction, 2) self-supervised ViTs are able to model important inductive biases about the hierarchical structure of phenotypes in the tumor microenvironment.

References

Page 1

	Year	Citations

Page 1