Publication | Closed Access
Multi-Lingual Multi-Task Speech Emotion Recognition Using wav2vec 2.0
50
Citations
23
References
2022
Year
EngineeringMachine LearningSpeech CorpusMultilingualismSpoken Language ProcessingMultilingual PretrainingSpeech RecognitionNatural Language ProcessingData ScienceAffective ComputingVoice RecognitionHealth SciencesSpeech Emotion RecognitionDeep LearningSpeech CommunicationSpeech AnalysisVoice AssistantsVoiceMulti-speaker Speech RecognitionSer SystemSpeech ProcessingSpeech InputSpeech PerceptionLinguisticsEmotion Recognition
Speech Emotion Recognition (SER) has several use cases for Digital Entertainment Content (DEC) in Over-the-top (OTT) services, emotive Text-to-Speech (TTS) engines and voice assistants. In this work, we present a Multi-Lingual (MLi) and Multi-Task Learning (MTL) audio only SER system based on the multi-lingual pre-trained wav2vec 2.0 model. The model is fine-tuned on 25 open source datasets in 13 locales across 7 emotion categories. We show that, a) Our wav2vec 2.0 single task based model outperforms Pre-trained Audio Neural Network (PANN) based single task pre-trained model by 7.2% (relative), b) The best MTL model outperforms the PANN based and wav2vec 2.0 based single task models by 8.6% and 1.7% (relative) respectively, c) The MTL based system outperforms pre-trained single task wav2vec 2.0 model in 9 out of 13 locales in terms of weighted F1 scores, and d) The MTL-MLi wav2vec 2.0 outperforms the state-of-the-art for the languages contained in the pre-training corpora.
| Year | Citations | |
|---|---|---|
Page 1
Page 1