Publication | Closed Access
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
1.7K
Citations
35
References
2017
Year
Unknown Venue
EngineeringMachine LearningSpeech RecognitionNatural Language ProcessingMultimodal LlmImage AnalysisText-to-image RetrievalVisual GroundingVisual SentinelVisual Question AnsweringMachine TranslationMachine VisionVisual AttentionVision Language ModelDeep LearningImage CaptioningComputer VisionAdaptive AttentionVisual Information
Attention‑based encoder‑decoder frameworks dominate image captioning, yet most methods force visual attention for every word, whereas non‑visual words can often be generated from language alone. This work introduces an adaptive attention model that incorporates a visual sentinel to selectively attend the image. At each decoding step the model chooses whether to attend the image and which regions, or to rely on the visual sentinel, and is evaluated on COCO 2015 and Flickr30K. The proposed approach achieves a new state‑of‑the‑art performance, surpassing previous methods by a significant margin.
Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem visual can often be predicted reliably just from the language model e.g., sign after behind a red stop or phone following talking on a cell. In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.
| Year | Citations | |
|---|---|---|
Page 1
Page 1