Publication | Closed Access
Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification
91
Citations
38
References
2022
Year
EngineeringMachine LearningAutomatic Speaker VerificationSpeech RecognitionNatural Language ProcessingData ScienceSpeaker DiarizationRobust Speech RecognitionVoice RecognitionHealth SciencesSpeech RepresentationsEnsemble SystemComputer ScienceDeep LearningVoxceleb Dataset ShowSpeech CommunicationMulti-speaker Speech RecognitionSpeech ProcessingSpeech InputSpeech PerceptionSpeaker Recognition
The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.
| Year | Citations | |
|---|---|---|
Page 1
Page 1