Image Captioning with Semantic Attention

TLDR

Automatic image captioning is a key AI task linking computer vision and natural language processing, with existing methods either top‑down or bottom‑up. The authors introduce a semantic‑attention model that merges top‑down and bottom‑up strategies. The model selectively attends to semantic concept proposals, fuses them into RNN hidden states and outputs, and uses a feedback loop that links top‑down and bottom‑up computations, evaluated on Microsoft COCO and Flickr30K. Experimental results show that the algorithm significantly outperforms state‑of‑the‑art approaches across multiple evaluation metrics.

Abstract

Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which come up with words describing various aspects of an image and then combine them. In this paper, we propose a new algorithm that combines both approaches through a model of semantic attention. Our algorithm learns to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks. The selection and fusion form a feedback connecting the top-down and bottom-up computation. We evaluate our algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental results show that our algorithm significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

References

Page 1

	Year	Citations

Page 1