Publication | Open Access
MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos
344
Citations
32
References
2016
Year
EngineeringMultimedia AnalysisCommunicationMultimodal Sentiment AnalysisSentiment AnalysisCorpus LinguisticsText MiningNatural Language ProcessingSocial MediaData ScienceComputational LinguisticsAffective ComputingOnline VideoLanguage StudiesContent AnalysisSubjectivity AnalysisMultimodal Signal ProcessingDeep LearningOnline Opinion VideosMulti-modal SummarizationSentiment IntensityLinguisticsEmotion RecognitionOpinion Aggregation
Online video platforms host daily user‑generated opinions, yet sentiment and subjectivity analysis for such multimedia content remains underexplored due to a lack of datasets, methods, baselines, and multimodal statistical analysis. The study introduces MOSI, the first opinion‑level annotated corpus for sentiment and subjectivity in online videos, and proposes baseline models and a multimodal fusion approach. MOSI is annotated with subjectivity, sentiment intensity, per‑frame visual features, and per‑millisecond audio features, and the authors provide baseline models and a multimodal fusion method combining spoken words and visual gestures.
People are sharing their opinions, stories and reviews through online video sharing websites every day. Studying sentiment and subjectivity in these opinion videos is experiencing a growing attention from academia and industry. While sentiment analysis has been successful for text, it is an understudied research question for videos and multimedia content. The biggest setbacks for studies in this direction are lack of a proper dataset, methodology, baselines and statistical analysis of how information from different modality sources relate to each other. This paper introduces to the scientific community the first opinion-level annotated corpus of sentiment and subjectivity analysis in online videos called Multimodal Opinion-level Sentiment Intensity dataset (MOSI). The dataset is rigorously annotated with labels for subjectivity, sentiment intensity, per-frame and per-opinion annotated visual features, and per-milliseconds annotated audio features. Furthermore, we present baselines for future studies in this direction as well as a new multimodal fusion approach that jointly models spoken words and visual gestures.
| Year | Citations | |
|---|---|---|
Page 1
Page 1