Say Cheese vs. Smile

Abstract

Facial movement is modulated both by emotion and speech articulation. Facial emotion recognition systems aim to discriminate between emotions, while reducing the speech-related variability in facial cues. This aim is often achieved using two key features: (1) phoneme segmentation: facial cues are temporally divided into units with a single phoneme and (2) phoneme-specific classification: systems learn patterns associated with groups of visually similar phonemes (visemes), e.g. P, B, and M. In this work, we empirically compare the effects of different temporal segmentation and classification schemes for facial emotion recognition. We propose an unsupervised segmentation method that does not necessitate costly phonetic transcripts. We show that the proposed method bridges the accuracy gap between a traditional sliding window method and phoneme segmentation, achieving a statistically significant performance gain. We also demonstrate that the segments derived from the proposed unsupervised and phoneme segmentation strategies are similar to each other. This paper provides new insight into unsupervised facial motion segmentation and the impact of speech variability on emotion classification.

References

Page 1

	Year	Citations

Page 1