Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

TLDR

Current image captioning methods rely on black‑box models that offer limited controllability and explainability, yet captions can vary widely depending on goals and context, necessitating more controllable approaches. This work proposes a framework that generates diverse, controllable image captions by explicitly grounding textual chunks on specified image regions. The framework uses a recurrent network that, given a control signal of image regions or a sequence, predicts caption chunks grounded on those regions while respecting the control constraints, and is evaluated on Flickr30k Entities and an extended COCO Entities dataset with semi‑automatic grounding annotations. The method achieves state‑of‑the‑art results on controllable captioning, improving both caption quality and diversity. Code and annotations are available at https://github.com/aimagelab/show-control-and-tell.

Abstract

Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code and annotations are publicly available at: https://github.com/aimagelab/show-control-and-tell.

References

Page 1

	Year	Citations

Page 1