Deep Reinforcement Learning-Based Image Captioning with Embedding Reward

TLDR

Image captioning remains challenging due to the need to understand complex visual content and generate diverse natural language descriptions, though recent deep neural network advances and encoder‑decoder frameworks have substantially improved performance. This work proposes a novel decision‑making framework for image captioning. The framework employs a policy network that guides word selection by estimating next‑word confidence and a value network that evaluates all possible caption continuations to steer generation toward ground‑truth similarity, with both networks trained via actor‑critic reinforcement learning using a visual‑semantic embedding reward. Experiments on the Microsoft COCO dataset demonstrate that the proposed method surpasses state‑of‑the‑art approaches across multiple evaluation metrics.

Abstract

Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, we introduce a novel decision-making framework for image captioning. We utilize a policy network and a value network to collaboratively generate captions. The policy network serves as a local guidance by providing the confidence of predicting the next word according to the current state. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the correct words towards the goal of generating captions similar to the ground truth captions. We train both networks using an actor-critic reinforcement learning model, with a novel reward defined by visual-semantic embedding. Extensive experiments and analyses on the Microsoft COCO dataset show that the proposed framework outperforms state-of-the-art approaches across different evaluation metrics.

References

Page 1

	Year	Citations

Page 1