Scaling Vision Transformers

Abstract

Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy of the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well for few-shot transfer, for example, reaching 84.86% top-1 accuracy on ImageNet with only 10 examples per class.

References

Page 1

	Year	Citations
Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Image ClassificationDeep Neural NetworksMachine VisionImage AnalysisMachine Learning	2016	214.9K
MizAR 60 for Mizar 50 DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2023	73.5K
ImageNet: A large-scale hierarchical image database Jia Deng, Wei Dong, Richard Socher, 2009 IEEE Conference on Computer Vision and Pattern Recognition EngineeringMachine LearningImage RetrievalImage DatabaseImage Recognition (Computer Vision)	2009	60.2K
AI-Assisted Pipeline for Dynamic Generation of Trustworthy Health Supplement Content at Scale DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2018	45.3K
Support-vector networks Corinna Cortes, Vladimir Vapnik Machine Learning	1995	39.8K
ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky, Jia Deng, Hao Su, International Journal of Computer Vision Image ClassificationConvolutional Neural NetworkMachine VisionImage AnalysisEngineering	2015	39.5K
Evaluating the Effectiveness of Large Language Models in Representing Textual Descriptions of Geometry and Spatial Relations (Short Paper) DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2023	14.1K
Attention Is All You Need Ashish Vaswani, Noam Shazeer, Niki Parmar, EngineeringMachine LearningSelective AttentionCognitionCommunication	2025	6.5K
Learning Transferable Visual Models From Natural Language Supervision Alec Radford, Jong Wook Kim, Chris Hallacy, arXiv (Cornell University) Few-shot LearningEngineeringMachine LearningNatural Language ProcessingMultimodal Llm	2021	5.3K
Emerging Properties in Self-Supervised Vision Transformers Mathilde Caron, Hugo Touvron, Ishan Misra, 2021 IEEE/CVF International Conference on Computer Vision (ICCV) Self-supervised Vit FeaturesImage ClassificationConvolutional Neural NetworkImage AnalysisMachine Vision	2021	4.6K

Page 1