LiT: Zero-Shot Transfer with Locked-image text Tuning

Abstract

This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text mod-els while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image mod-els with unlocked text models work best. We call this in-stance of contrastive-tuning “Locked-image Tuning” (LiT), which just teaches a text model to read out good repre-sentations from a pre-trained image model for new tasks. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT is widely applicable; it works reliably with multiple pre-training methods (supervised and unsu-pervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves 84.5% zero-shot trans-fer accuracy on the ImageNet test set, and 81.1% on the challenging out-of-distribution ObjectNet test set.

References

Page 1

	Year	Citations
Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Image ClassificationDeep Neural NetworksMachine VisionImage AnalysisMachine Learning	2016	214.9K
MizAR 60 for Mizar 50 DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2023	73.5K
ImageNet: A large-scale hierarchical image database Jia Deng, Wei Dong, Richard Socher, 2009 IEEE Conference on Computer Vision and Pattern Recognition EngineeringMachine LearningImage RetrievalImage DatabaseImage Recognition (Computer Vision)	2009	60.2K
Going deeper with convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Image ClassificationDeep Neural NetworksImage AnalysisMachine LearningData Science	2015	46.2K
AI-Assisted Pipeline for Dynamic Generation of Trustworthy Health Supplement Content at Scale DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2018	45.3K
A Survey on Transfer Learning Sinno Jialin Pan, Qiang Yang IEEE Transactions on Knowledge and Data Engineering EngineeringMachine LearningKnowledge TransferData MiningPattern Recognition	2009	22.5K
Evaluating the Effectiveness of Large Language Models in Representing Textual Descriptions of Geometry and Spatial Relations (Short Paper) DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2023	14.1K
Momentum Contrast for Unsupervised Visual Representation Learning Kaiming He, Haoqi Fan, Yuxin Wu, Convolutional Neural NetworkImage AnalysisMachine LearningData ScienceMachine Vision	2020	11.6K
Decoupled Weight Decay Regularization Ilya Loshchilov, Frank Hutter arXiv (Cornell University) EngineeringMachine LearningWeight DecayAtomic DecompositionImage Analysis	2017	9K
Exploring the Limits of Transfer Learning with a Unified Text-to-Text\n Transformer Colin Raffel, Noam Shazeer, Adam Roberts, arXiv (Cornell University)	2019	8.3K

Page 1