Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

TLDR

Speech emotion recognition is difficult because the affective gap between subjective emotions and low‑level acoustic features is large, and deep convolutional neural networks have successfully bridged similar semantic gaps in visual tasks. This study investigates whether a DCNN can close the affective gap in speech signals for emotion recognition. The authors convert speech into three log‑Mel‑spectrogram channels (static, delta, delta‑delta) as an RGB‑like input, feed them into a pretrained AlexNet to extract segment‑level features, aggregate these with a discriminant temporal pyramid matching that combines temporal pyramid matching and optimal Lp‑norm pooling, and classify the resulting utterance‑level representation with a linear SVM. Experiments on EMO‑DB, RML, eNTERFACE05, and BAUM‑1s demonstrate that the pretrained DCNN plus DTPM achieves promising recognition performance, and fine‑tuning on target datasets further improves accuracy.

Abstract

Speech emotion recognition is challenging because of the affective gap between the subjective emotions and low-level features. Integrating multilevel feature learning and model training, deep convolutional neural networks (DCNN) has exhibited remarkable success in bridging the semantic gap in visual tasks like image classification, object detection. This paper explores how to utilize a DCNN to bridge the affective gap in speech signals. To this end, we first extract three channels of log Mel-spectrograms (static, delta, and delta delta) similar to the red, green, blue (RGB) image representation as the DCNN input. Then, the AlexNet DCNN model pretrained on the large ImageNet dataset is employed to learn high-level feature representations on each segment divided from an utterance. The learned segment-level features are aggregated by a discriminant temporal pyramid matching (DTPM) strategy. DTPM combines temporal pyramid matching and optimal Lp-norm pooling to form a global utterance-level feature representation, followed by the linear support vector machines for emotion classification. Experimental results on four public datasets, that is, EMO-DB, RML, eNTERFACE05, and BAUM-1s, show the promising performance of our DCNN model and the DTPM strategy. Another interesting finding is that the DCNN model pretrained for image applications performs reasonably good in affective speech feature extraction. Further fine tuning on the target emotional speech datasets substantially promotes recognition performance.

References

Page 1

	Year	Citations

Page 1