Publication | Open Access
Learning spectro-temporal features with 3D CNNs for speech emotion recognition
37
Citations
24
References
2017
Year
Unknown Venue
Convolutional Neural NetworkEngineeringMachine LearningSpeech RecognitionNatural Language ProcessingData ScienceAffective ComputingRobust Speech RecognitionVoice RecognitionHealth SciencesSpeech Emotion RecognitionDeep LearningSpeech AnalysisSpeech CommunicationMulti-speaker Speech RecognitionSpeech ProcessingSpeech InputSpeech PerceptionDeep Spectral KernelsEmotion Recognition
In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other state-of-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectro-temporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions.
| Year | Citations | |
|---|---|---|
Page 1
Page 1