Publication | Closed Access
Reflective Decoding Network for Image Captioning
102
Citations
50
References
2019
Year
Unknown Venue
EngineeringMachine LearningCorpus LinguisticsReflective Decoding NetworkNatural Language ProcessingMultimodal LlmText-to-image RetrievalVisual GroundingComputational LinguisticsVisual Question AnsweringVocabulary CoherenceLanguage StudiesMachine TranslationVision Language ModelRelative PositionDeep LearningImage CaptioningComputer VisionLinguistics
Current image captioning methods focus mainly on visual features, neglecting language properties that could improve performance. The study demonstrates that language coherence and syntax are crucial for high‑quality captions and introduces the Reflective Decoding Network to enhance long‑sequence dependency and word position perception. RDN extends the encoder‑decoder architecture by jointly attending to visual and textual cues while modeling each word’s relative position to strengthen caption generation. RDN outperforms prior methods on COCO and especially improves caption quality for complex scenes.
State-of-the-art image captioning methods mostly focus on improving visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. In this paper, we show that vocabulary coherence between words and syntactic paradigm of sentences are also important to generate high-quality image caption. Following the conventional encoder-decoder framework, we propose the Reflective Decoding Network (RDN) for image captioning, which enhances both the long-sequence dependency and position perception of words in a caption decoder. Our model learns to collaboratively attend on both visual and textual features and meanwhile perceive each word's relative position in the sentence to maximize the information delivered in the generated caption. We evaluate the effectiveness of our RDN on the COCO image captioning datasets and achieve superior performance over the previous methods. Further experiments reveal that our approach is particularly advantageous for hard cases with complex scenes to describe by captions.
| Year | Citations | |
|---|---|---|
Page 1
Page 1