VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

TLDR

The paper introduces VSE++, a new technique for learning visual‑semantic embeddings to improve cross‑modal retrieval. VSE++ modifies standard multi‑modal loss functions by incorporating hard negatives, and is evaluated on MS‑COCO and Flickr30K with ablation studies and comparisons to prior methods. The method achieves significant retrieval gains, outperforming state‑of‑the‑art on MS‑COCO by 8.8% in caption retrieval and 11.3% in image retrieval (R@1).

Abstract

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).

References

Page 1

	Year	Citations

Page 1