Publication | Closed Access
Fine-grained Image-text Matching by Cross-modal Hard Aligning Network
95
Citations
42
References
2023
Year
Unknown Venue
EngineeringMachine LearningNatural Language ProcessingMultimodal LlmImage AnalysisText-to-image RetrievalVisual GroundingPattern RecognitionText RecognitionVisual Question AnsweringRedundant AlignmentsFine-grained Image-text MatchingMachine TranslationMachine VisionVision Language ModelImage SimilarityDeep LearningComputer VisionMeaningful AlignmentsOther Alignments
Current state-of-the-art image-text matching methods implicitly align the visual-semantic fragments, like regions in images and words in sentences, and adopt cross-attention mechanism to discover fine-grained cross-modal semantic correspondence. However, the cross-attention mechanism may bring redundant or irrelevant region-word alignments, degenerating retrieval accuracy and limiting efficiency. Although many researchers have made progress in mining meaningful alignments and thus improving accuracy, the problem of poor efficiency remains unresolved. In this work, we propose to learn fine-grained image-text matching from the perspective of information coding. Specifically, we suggest a coding framework to explain the fragments aligning process, which provides a novel view to reexamine the cross-attention mechanism and analyze the problem of redundant alignments. Based on this framework, a Cross-modal Hard Aligning Network (CHAN) is designed, which comprehensively exploits the most relevant region-word pairs and eliminates all other alignments. Extensive experiments conducted on two public datasets, MS-COCO and Flickr30K, verify that the relevance of the most associated word-region pairs is discriminative enough as an indicator of the image-text similarity, with superior accuracy and efficiency over the state-of-the-art approaches on the bidirectional image and text retrieval tasks. Our code will be available at https://github.com/ppanzx/CHAN.
| Year | Citations | |
|---|---|---|
Page 1
Page 1