BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

TLDR

Vision‑language pre‑training has improved many tasks, yet most models excel only in either understanding or generation, and progress has largely come from scaling noisy web image‑text pairs, an imperfect supervision source. We introduce BLIP, a VLP framework that flexibly transfers to both vision‑language understanding and generation tasks. BLIP bootstraps noisy web data by generating synthetic captions with a captioner and filtering out noisy ones. BLIP attains state‑of‑the‑art performance on image‑text retrieval, captioning, and VQA, generalizes zero‑shot to video‑language tasks, and its code, models, and datasets are publicly released.

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.