Concepedia

Publication | Closed Access

Stacked Multimodal Attention Network for Context-Aware Video Captioning

41

Citations

47

References

2021

Year

Abstract

Recent neural models for video captioning usually employ an attention-based encoder-decoder framework. However, current approaches mainly attend to the motion features and object features of the video when generating the caption, but ignore the potential but useful historical information. Besides, exposure bias and vanishing gradients problems always exist in current caption generation models. In this paper, we propose a novel video captioning framework, named Stacked Multimodal Attention Network (SMAN). It adopts additional visual and textual historical information during caption generation as context features, employs a stacked architecture to process different features gradually, and utilizes the Reinforcement Learning method and coarse-to-fine training strategy to further improve the generated results. Both quantitative and qualitative experiments on the benchmark datasets of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MSVD</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MSR-VTT</i> show the effectiveness and feasibility of our framework. The codes are available on <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/zhengyi123456/SMAN</uri> .

References

YearCitations

1997

93.8K

2016

52.4K

2015

39.5K

2001

20.9K

2017

11.6K

2014

10.7K

1992

7.4K

2015

6.2K

2018

5K

2015

4.6K

Page 1