Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning

Abstract

Existing research for visual captioning usually employs a CNN-RNN architecture that combines a CNN for image encoding with a RNN for caption generation, where the vocabulary is constructed from the entire training dataset as the decoding space. Such approaches typically suffer from the problem of generating N-grams which occur frequently in the training set but are irrelevant to the given image. To tackle this problem, we propose to construct an image-grounded vocabulary that leverages image semantics for more effective caption generation. More concretely, a two-step approach is proposed to construct the vocabulary by incorporating both visual information and relationships among words. Two strategies are then explored to utilize the constructed vocabulary for caption generation. One constrains the generator to select words from the image-grounded vocabulary only and the other integrates the vocabulary information into the RNN cell during the caption generation process. Experimental results on two public datasets show the effectiveness of our framework compared to state-of-the-art models. Our code is available on Github 1 .

References

Page 1

	Year	Citations

Page 1