Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning

TLDR

Image captioning is a challenging AI task that requires understanding of complex visuals and natural language, and recent work has applied reinforcement learning to improve word‑by‑word generation, but existing RL approaches rely on a single policy and reward that do not capture the task’s multi‑level and multi‑modal nature. To address this gap, we propose a novel multi‑level policy and reward reinforcement learning framework that can be integrated with RNN‑based captioning models, language metrics, or visual‑semantic functions. The framework comprises a multi‑level policy network that jointly updates word‑ and sentence‑level policies, a multi‑level reward function that combines vision‑language and language‑language rewards, and a guidance term that bridges policy and reward during RL optimization. Experiments on MSCOCO and Flickr30k show that the framework achieves competitive performance across multiple evaluation metrics, and ablation studies confirm its generalization across different captioning models and metrics.

Abstract

Image captioning is one of the most challenging tasks in AI because it requires an understanding of both complex visuals and natural language. Because image captioning is essentially a sequential prediction task, recent advances in image captioning have used reinforcement learning (RL) to better explore the dynamics of word-by-word generation. However, the existing RL-based image captioning methods rely primarily on a single policy network and reward function-an approach that is not well matched to the multi-level (word and sentence) and multi-modal (vision and language) nature of the task. To solve this problem, we propose a novel multi-level policy and reward RL framework for image captioning that can be easily integrated with RNN-based captioning models, language metrics, or visual-semantic functions for optimization. Specifically, the proposed framework includes two modules: 1) a multi-level policy network that jointly updates the word- and sentence-level policies for word generation; and 2) a multi-level reward function that collaboratively leverages both a vision-language reward and a language-language reward to guide the policy. Furthermore, we propose a guidance term to bridge the policy and the reward for RL optimization. The extensive experiments on the MSCOCO and Flickr30k datasets and the analyses show that the proposed framework achieves competitive performances on a variety of evaluation metrics. In addition, we conduct ablation studies on multiple variants of the proposed framework and explore several representative image captioning models and metrics for the word-level policy network and the language-language reward function to evaluate the generalization ability of the proposed framework.

References

Page 1

	Year	Citations

Page 1