Publication | Closed Access
Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention
264
Citations
58
References
2013
Year
Movie SummarizationEngineeringEntity SummarizationNarrative SummarizationVideo SummarizationAttentionVideo RetrievalAutomatic SummarizationSpeech RecognitionImage AnalysisMultimodal StreamsVisual SaliencyProduced SummariesMultimodal Signal ProcessingVideo UnderstandingComputer VisionMulti-modal SummarizationEye TrackingTextual AttentionMultimodal Saliency
Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color, and orientation. Textual or linguistic saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality.
| Year | Citations | |
|---|---|---|
Page 1
Page 1