Visual attention based on long-short term memory model for image caption generation

Abstract

Image caption generation becomes a raising topic in computer vision and artificial intelligence. In order to solve the problem of stiff description, we intend to extract richer features using convolutional neural network (CNN). A neural and probabilistic framework has been proposed consequently which combines CNN with a special form of recurrent neural network (RNN) to produce an end-to-end image captioning. We use a model that takes advantage of word to vector to encode the variable length input into a fixed dimensional vector. Considering the description of the object in an image is not specific enough, we introduce an attention mechanism through visualization to show how the model is able to fix its gaze on salient objects. We validate our model on three benchmark datasets and get great performance by using standard evaluation metrics.

References

Page 1

	Year	Citations

Page 1