Multi-modal Emotion Recognition Based on Deep Learning in Speech, Video and Text

Abstract

Emotions are a concrete manifestation of human communication, and the research on emotion recognition has gradually increased. Recently, researchers have attached great importance to multi-modal emotion recognition, and in the field of speech, video, text and physiological signal emotion recognition, a lot of research work has been carried out. Multimodal emotion recognition complements each other by fusing information between different modalities, thereby improving the final recognition rate. This paper preprocesses the three modes of speech, video and text of the IEMOCAP dataset, uses deep learning neural networks to extract emotional features, and performs information fusion at the feature layer. There are five types of emotions: angry, excited, sad, neutral and happy. From the results, the accuracy of the three-mode emotion recognition model of the training set is 0.9541, and that of the verification set is 0.68383. Compared to speech emotion recognition improved by 0.11751.

References

Page 1

	Year	Citations

Page 1