Publication | Open Access
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
905
Citations
0
References
2023
Year
EngineeringMachine LearningFrozen Image EncoderFrozen Language ModelBootstrapping Language-image Pre-trainingLarge Language ModelLarge Language ModelsNatural Language ProcessingMultimodal LlmImage AnalysisZero-shot LearningVisual GroundingModality GapVideo TransformerMachine TranslationLarge Ai ModelMachine VisionVision Language ModelComputer ScienceDeep LearningComputer VisionFrozen Image EncodersLinguistics
The cost of vision‑and‑language pre‑training has risen sharply because end‑to‑end training of large‑scale models is prohibitively expensive. This work introduces BLIP‑2, a generic, efficient pre‑training strategy that bootstraps vision‑language learning from frozen image encoders and large language models. BLIP‑2 uses a lightweight Querying Transformer pre‑trained in two stages—first aligning frozen image encoder representations with language, then training vision‑to‑language generation with a frozen language model. BLIP‑2 attains state‑of‑the‑art results on multiple vision‑language benchmarks while using far fewer trainable parameters, outperforming Flamingo‑80B by 8.7% on zero‑shot VQAv2 with 54× fewer parameters and enabling zero‑shot image‑to‑text generation guided by natural language prompts.
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.