Concepedia

Publication | Open Access

Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning

24

Citations

37

References

2019

Year

Abstract

Existing research for visual captioning usually employs a CNN-RNN architecture that combines a CNN for image encoding with a RNN for caption generation, where the vocabulary is constructed from the entire training dataset as the decoding space. Such approaches typically suffer from the problem of generating N-grams which occur frequently in the training set but are irrelevant to the given image. To tackle this problem, we propose to construct an image-grounded vocabulary that leverages image semantics for more effective caption generation. More concretely, a two-step approach is proposed to construct the vocabulary by incorporating both visual information and relationships among words. Two strategies are then explored to utilize the constructed vocabulary for caption generation. One constrains the generator to select words from the image-grounded vocabulary only and the other integrates the vocabulary information into the RNN cell during the caption generation process. Experimental results on two public datasets show the effectiveness of our framework compared to state-of-the-art models. Our code is available on Github 1 .

References

YearCitations

Page 1