Summarization-based Video Caption via Deep Neural Networks

Abstract

Generating appropriate descriptions for visual content draws increasing attention recently, where the promising progresses were obtained owing to the breakthroughs in deep neural networks. Different from the traditional SVO (subject, verb, object) based methods, in this paper, we propose a novel framework of video caption via deep neural networks. For each frame, we extract visual features by a fine-tuned deep Convulutional Neural Networks (CNN), which are then fed into a Recurrent Neural Networks (RNN) to generate novel sentences descriptions for each frame. In order to obtain the most representative and high-quality descriptions for target video, a well-devised automatic summarization process is incorporated to reduce the noises by ranking on the sentence-sequence graph. Moreover, our framework owns the merit of describing out-of-sample videos by transferring knowledge from pre-captioned images. Experiments on the benchmark datasets demonstrate our method has better performance than the state-of-the-art methods of video caption in language generation metrics as well as SVO accuracy.

References

Page 1

	Year	Citations

Page 1