Emotion spotting: discovering regions of evidence in audio-visual emotion expressions

Abstract

Research has demonstrated that humans require different amounts of information, over time, to accurately perceive emotion expressions. This varies as a function of emotion classes. For example, recognition of happiness requires a longer stimulus than recognition of anger. However, previous automatic emotion recognition systems have often overlooked these differences. In this work, we propose a data-driven framework to explore patterns (timings and durations) of emotion evidence, specific to individual emotion classes. Further, we demonstrate that these patterns vary as a function of which modality (lower face, upper face, or speech) is examined, and consistent patterns emerge across different folds of experiments. We also show similar patterns across emotional corpora (IEMOCAP and MSP-IMPROV). In addition, we show that our proposed method, which uses only a portion of the data (59% for the IEMOCAP), achieves comparable accuracy to a system that uses all of the data within each utterance. Our method has a higher accuracy when compared to a baseline method that randomly chooses a portion of the data. We show that the performance gain of the method is mostly from prototypical emotion expressions (defined as expressions with rater consensus). The innovation in this study comes from its understanding of how multimodal cues reveal emotion over time.

References

Page 1

	Year	Citations

Page 1