Concepedia

Abstract

In this manuscript, the topic of multi-corpus Speech Emotion Recognition (SER) is approached from a deep transfer learning perspective. A large corpus of emotional speech data, <b><small>EmoSet</small></b> , is assembled from a number of existing Speech Emotion Recognition (SER) corpora. In total, <small>EmoSet</small> contains <b>84 181 audio recordings</b> from <b>26 SER corpora</b> with a total duration of over <b>65 hours</b> . The corpus is then utilised to create a novel framework for multi-corpus SER and general audio recognition, namely <b><small>EmoNet</small></b> . A combination of a deep ResNet architecture and residual adapters is transferred from the field of multi-domain visual recognition to multi-corpus SER on <small>EmoSet</small> . The introduced residual adapter approach enables parameter efficient training of a multi-domain SER model on all 26 corpora. A shared model with only 3.5 times the number of parameters of a model trained on a single database leads to increased performance for 21 of the 26 corpora in <small>EmoSet</small> . Using repeated training runs and Almost Stochastic Order with significance level of <inline-formula><tex-math notation="LaTeX">$\alpha = 0.05$</tex-math></inline-formula> , these improvements are further significant for 15 datasets while there are just three corpora that see only significant decreases across the residual adapter transfer experiments. Finally, we make our <small>EmoNet</small> framework publicly available for users and developers at <monospace><uri>https://github.com/EIHW/EmoNet</uri></monospace> .

References

YearCitations

Page 1