Boosting Image Captioning with Attributes

TLDR

Automatically describing images in natural language is an emerging challenge in computer vision and natural language processing. The paper proposes LSTM‑A, an end‑to‑end architecture that integrates attributes into the CNN‑RNN image captioning framework. The method learns attributes through Multiple Instance Learning that captures inter‑attribute correlations, then feeds image features and attributes into RNNs in various configurations to model their relationship. Experiments on the COCO dataset show that LSTM‑A outperforms state‑of‑the‑art models, achieving METEOR 25.5% and CIDEr‑D 100.2% and ranking top on the COCO captioning leaderboard.

Abstract

Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing. In this paper, we present Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner. Particularly, the learning of attributes is strengthened by integrating inter-attribute correlations into Multiple Instance Learning (MIL). To incorporate attributes into captioning, we construct variants of architectures by feeding image representations and attributes into RNNs in different ways to explore the mutual but also fuzzy relationship between them. Extensive experiments are conducted on COCO image captioning dataset and our framework shows clear improvements when compared to state-of-the-art deep models. More remarkably, we obtain METEOR/CIDEr-D of 25.5%/100.2% on testing data of widely used and publicly available splits in [10] when extracting image representations by GoogleNet and achieve superior performance on COCO captioning Leaderboard.

References

Page 1

	Year	Citations

Page 1