Publication | Open Access
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
579
Citations
25
References
2017
Year
EngineeringMachine LearningRanking Loss FunctionsImage RetrievalHard NegativesMultimodal LearningImage SearchNatural Language ProcessingImage AnalysisInformation RetrievalData ScienceVisual GroundingPattern RecognitionVisual Question AnsweringMachine VisionVision Language ModelComputer ScienceCross-modal RetrievalHard Negative MiningDeep LearningComputer Vision
The paper introduces VSE++, a new technique for learning visual‑semantic embeddings to improve cross‑modal retrieval. VSE++ modifies standard multi‑modal loss functions by incorporating hard negatives, and is evaluated on MS‑COCO and Flickr30K with ablation studies and comparisons to prior methods. The method achieves significant retrieval gains, outperforming state‑of‑the‑art on MS‑COCO by 8.8% in caption retrieval and 11.3% in image retrieval (R@1).
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).
| Year | Citations | |
|---|---|---|
Page 1
Page 1