Injecting Text in Self-Supervised Speech Pretraining

Abstract

Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text. Lexical learning in the speech encoder is enforced through an additional sequence loss term that is coupled with contrastive loss during pretraining. We demonstrate that this novel pretraining method yields Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task over a state-of-the-art baseline pretrained with wav2vec2.0 only. The proposed method also serves as an effective strategy to compensate for the lack of transcribed speech, effectively matching the performance of 5000 hours of transcribed speech with just 100 hours of transcribed speech on the AMI meeting transcription task. Finally, we demonstrate WER reductions of up to 15% on an inhouse Voice Search task over traditional pretraining. Incorporating text into encoder pretraining is complimentary to rescoring with a larger or in-domain language model, resulting in additional 6% relative reduction in WER.

References

Page 1

	Year	Citations
MizAR 60 for Mizar 50 DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2023	73.5K
Speech recognition with deep recurrent neural networks Alex Graves, Abdelrahman Mohamed, Geoffrey E. Hinton Natural Language ProcessingDeep Neural NetworksRnn PerformanceMachine LearningEngineering	2013	8.7K
Librispeech: An ASR corpus based on public domain audio books Vassil Panayotov, Guoguo Chen, Daniel Povey, EngineeringSpeech CorpusSpoken Language ProcessingCorpus LinguisticsSpeech Recognition	2015	5.7K
Connectionist temporal classification Alex Graves, Santiago Fernández, Faustino Gomez, EngineeringMachine LearningSpoken Language ProcessingRecurrent Neural NetworkSpeech Recognition	2006	5.3K
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition Daniel Park, William Chan, Yu Zhang,	2019	3.4K
Conformer: Convolution-augmented Transformer for Speech Recognition Anmol Gulati, James Qin, Chung‐Cheng Chiu, EngineeringMachine LearningGlobal DependenciesSpeech RecognitionNatural Language Processing	2020	2.5K
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, arXiv (Cornell University) EngineeringMachine LearningSpoken Language ProcessingSpeech RecognitionNatural Language Processing	2020	2.4K
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence Kihyuk Sohn, David Berthelot, Chunliang Li, arXiv (Cornell University) Artificial IntelligenceStructured PredictionEngineeringMachine LearningConsistency Regularization	2020	2.3K
Self-Training With Noisy Student Improves ImageNet Classification Qizhe Xie, Minh-Thang Luong, Eduard Hovy, EngineeringMachine LearningMultimodal LlmImagenet ClassificationImage Analysis	2020	2.2K
Sequence Transduction with Recurrent Neural Networks Alex Graves arXiv (Cornell University) Structured PredictionEngineeringMachine LearningSequential LearningRecurrent Neural Network	2012	1.3K

Page 1