Alleviating ASR Long-Tailed Problem by Decoupling the Learning of Representation and Classification

Abstract

Recently, we have witnessed excellent improvement of end-to-end (E2E) automatic speech recognition (ASR). However, how to tackle the long-tailed data distribution problem while maintaining E2E ASR models' performance for high-frequency tokens is still challenging. To solve this challenge, we propose a novel decoupled ASR learning method for the sequence-to-sequence ASR architecture in this paper. Our method decouples the learning procedure of this model into two stages: representation learning and classification learning. In the representation learning stage, we use the encoder output of a pretrained language model as one of the ASR model’s learning targets, and propose threshold log cosine embedding loss (TLCE-loss) as the objective function. A frequency-mask cross-entropy loss (FMCE-loss) is also designed as an auxiliary loss. In the classification learning stage, we find that introducing a temperature into softmax function helps reduce the influence of negative samples on tail classes, thus mitigating the biased learning process for the classifier. Furthermore, we propose a weighted softmax (w-softmax) to adjust ASR posterior probabilities according to the token appearing frequency during inference. Additionally, we introduce tail word/character error rate (TWER / TCER) and head word/character error rate (HWER / HCER) that respectively evaluate the ASR accuracy for tail and head words/characters. Experimental results on the Switchboard and HKUST corpora show that our proposed method greatly outperforms the baseline, especially in TWER / TCER reduction. To the best of our knowledge, this is the first work to use a decoupled ASR learning method to alleviate the long-tailed problem in sequence-to-sequence ASR.

References

Page 1

	Year	Citations

Page 1