Concepedia

Publication | Open Access

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

13

Citations

39

References

2024

Year

Abstract

Abstract To address the challenges of the poor representation capability and low data utilization rate of end-to-end speech recognition models in deep learning, this study proposes an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed method emphasizes the importance of inter-layer information within shared encoders, aiming to enhance the model’s characterization capability via the multi-scale feature fusion module. Moreover, we apply multi-view self-supervised learning to effectively exploit data information. Our approach is rigorously evaluated on the Aishell-1 dataset and further validated its effectiveness on the English corpus WSJ. The experimental results demonstrate a noteworthy 4.6 $$\%$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mo>%</mml:mo> </mml:math> reduction in character error rate, indicating significantly improved speech recognition performance . These findings showcase the effectiveness and potential of our proposed MM-ASR model for end-to-end speech recognition tasks.

References

YearCitations

Page 1