CLIPScore: A Reference-free Evaluation Metric for Image Captioning

TLDR

Expected 4 sentences. Let's aggregate content. Background: two sentences: "Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans." and "This is in contrast to the reference-free manner in which humans assess caption quality." So background: "Image captioning has traditionally used reference-based automatic evaluations, whereas humans assess caption quality in a reference-free manner." Purpose: lines with Purpose label: first Purpose, Findings: "In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references." Also "Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation." So purpose: "The paper introduces CLIPScore, a reference‑free evaluation metric for image captioning, and a reference‑augmented variant RefCLIPScore, aiming to provide robust automatic assessment without references." Keep concise. Mechanism: lines with Mechanism: "Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation." Also maybe earlier: "CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references." So mechanism: "CLIPScore uses the pretrained CLIP model to compute image‑text compatibility scores, while RefCLIPScore augments this with reference captions to improve correlation." Keep single sentence.

Abstract

Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.

References

Page 1

	Year	Citations

Page 1