Analysis of emotion recognition using facial expressions, speech and multimodal information

TLDR

Emotion recognition from non‑verbal cues is essential for natural human‑computer interaction, yet most studies focus on either facial expressions or speech, with limited multimodal fusion to enhance accuracy and robustness. This study evaluates the strengths and limitations of facial‑expression‑only and acoustic‑only emotion recognition systems and compares decision‑level and feature‑level fusion strategies. The authors used a motion‑capture database of an actress performing sadness, anger, happiness, and neutral states, recording detailed facial motions and simultaneous speech to evaluate emotion classification. Facial‑expression‑based recognition outperformed acoustic‑only systems, and fusing facial and speech modalities at decision or feature level measurably improved both accuracy and robustness.

Abstract

The interaction between human beings and computers will be more natural if computers are able to perceive and respond to human non-verbal communication such as emotions. Although several approaches have been proposed to recognize human emotions based on facial expressions or speech, relatively limited work has been done to fuse these two, and other, modalities to improve the accuracy and robustness of the emotion recognition system. This paper analyzes the strengths and the limitations of systems based only on facial expressions or acoustic information. It also discusses two approaches used to fuse these two modalities: decision level and feature level integration. Using a database recorded from an actress, four emotions were classified: sadness, anger, happiness, and neutral state. By the use of markers on her face, detailed facial motions were captured with motion capture, in conjunction with simultaneous speech recordings. The results reveal that the system based on facial expression gave better performance than the system based on just acoustic information for the emotions considered. Results also show the complementarily of the two modalities and that when these two modalities are fused, the performance and the robustness of the emotion recognition system improve measurably.

References

Page 1

	Year	Citations

Page 1