Learning Class-Aligned and Generalized Domain-Invariant Representations for Speech Emotion Recognition

Abstract

Although recent research on speech emotion recognition has demonstrated that learning domain-invariant features provide an elegant solution to domain mismatch, the features learned by the existing methods lack generalization capabilities to capture latent information from datasets. We propose two novel domain adaptation methods, the generalized domain adversarial neural network (GDANN) and the class-aligned GDANN (CGDANN), to learn generalized domain-invariant representations for emotion recognition. GDANN and CGDANN, which are derived from multitask learning (MTL), consist of three tasks. The main task is to recognize the emotional category to which the input belongs. The remaining two tasks are auxiliary tasks. One is to use a variational autoencoder to model the input distribution, which encourages the model to learn the distribution of latent representations. The other is to learn the common representations of different domains, for which distinguishing via the domain classifier is difficult. The gradient of the domain classifier guides the shared representations of the source and target domains to approximate each other using a gradient reversal layer. To evaluate the effectiveness of the proposed methods, we conduct several experiments with the IEMOCAP and MSP-IMPROV datasets. The results illustrate that good performance is achieved compared with that of state-of-the-art methods. Notably, CGDANN utilizes a small quantity of labeled target domain samples to align the distribution representation and obtains the best performance among the comparison methods. We further visualize the representations learned by the proposed methods and discover that the representations of the source and target domains converge with a low variance.

References

Page 1

	Year	Citations

Page 1