Publication | Closed Access
Describing Videos using Multi-modal Fusion
115
Citations
19
References
2016
Year
Unknown Venue
EngineeringMachine LearningVideo SummarizationVideo RetrievalSpeech RecognitionNatural Language ProcessingMultimodal LlmImage AnalysisPattern RecognitionComputational LinguisticsMeta ModalitiesMulti-modal FusionMachine TranslationVision Language ModelMultimodal Signal ProcessingVideo UnderstandingDeep LearningMsr VideoComputer VisionMulti-modal Summarization
Describing videos with natural language is one of the ultimate goals of video understanding. Video records multi-modal information including image, motion, aural, speech and so on. MSR Video to Language Challenge provides a good chance to study multi-modality fusion in caption task. In this paper, we propose the multi-modal fusion encoder and integrate it with text sequence decoder into an end-to-end video caption framework. Features from visual, aural, speech and meta modalities are fused together to represent the video contents. Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs) are then used as the decoder to generate natural language sentences. Experimental results show the effectiveness of multi-modal fusion encoder trained in the end-to-end framework, which achieved top performance in both common metrics evaluation and human evaluation.
| Year | Citations | |
|---|---|---|
Page 1
Page 1