Stacked Multimodal Attention Network for Context-Aware Video Captioning

Abstract

Recent neural models for video captioning usually employ an attention-based encoder-decoder framework. However, current approaches mainly attend to the motion features and object features of the video when generating the caption, but ignore the potential but useful historical information. Besides, exposure bias and vanishing gradients problems always exist in current caption generation models. In this paper, we propose a novel video captioning framework, named Stacked Multimodal Attention Network (SMAN). It adopts additional visual and textual historical information during caption generation as context features, employs a stacked architecture to process different features gradually, and utilizes the Reinforcement Learning method and coarse-to-fine training strategy to further improve the generated results. Both quantitative and qualitative experiments on the benchmark datasets of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MSVD</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MSR-VTT</i> show the effectiveness and feasibility of our framework. The codes are available on <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/zhengyi123456/SMAN</uri> .

References

Page 1

	Year	Citations
Long Short-Term Memory Sepp Hochreiter, Jürgen Schmidhuber Neural Computation	1997	93.8K
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, IEEE Transactions on Pattern Analysis and Machine Intelligence	2016	52.4K
ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky, Jia Deng, Hao Su, International Journal of Computer Vision Image ClassificationConvolutional Neural NetworkMachine VisionImage AnalysisEngineering	2015	39.5K
BLEU Kishore Papineni, Salim Roukos, Todd J. Ward, Natural Language ProcessingComputer-assisted TranslationEngineeringCorpus LinguisticsHuman Evaluation	2001	20.9K
Aggregated Residual Transformations for Deep Neural Networks Saining Xie, Ross Girshick, Piotr Dollár, Convolutional Neural NetworkEngineeringMachine LearningAutoencodersImage Classification	2017	11.6K
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Jun‐Young Chung, Çaǧlar Gülçehre, Kyunghyun Cho, arXiv (Cornell University) MusicStructured PredictionEngineeringMachine LearningSequential Learning	2014	10.7K
Simple statistical gradient-following algorithms for connectionist reinforcement learning Ronald J. Williams Machine Learning Artificial IntelligenceEngineeringMachine LearningSequential LearningConnectionist Reinforcement Learning	1992	7.4K
Show and tell: A neural image caption generator Oriol Vinyals, Alexander Toshev, Samy Bengio, Natural Language ProcessingArtificial IntelligenceLarge Ai ModelMultimodal LlmEngineering	2015	6.2K
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Peter Anderson, Xiaodong He, Chris Buehler, EngineeringMachine LearningAttentionSocial SciencesNatural Language Processing	2018	5K
CIDEr: Consensus-based image description evaluation Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh Image DescriptionsEngineeringCorpus LinguisticsNatural Language ProcessingMultimodal Llm	2015	4.6K

Page 1