Curriculum Pre-training for End-to-End Speech Translation

Abstract

End-to-end speech translation poses a heavy burden on the encoder because it has to transcribe, understand, and learn cross-lingual semantics simultaneously. To obtain a powerful encoder, traditional methods pre-train it on ASR data to capture speech features. However, we argue that pre-training the encoder only through simple speech recognition is not enough, and high-level linguistic knowledge should be considered. Inspired by this, we propose a curriculum pre-training method that includes an elementary course for transcription learning and two advanced courses for understanding the utterance and mapping words in two languages. The difficulty of these courses is gradually increasing. Experiments show that our curriculum pre-training method leads to significant improvements on En-De and En-Fr speech translation benchmarks.

References

Page 1

	Year	Citations
MizAR 60 for Mizar 50 DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2023	73.5K
Neural Machine Translation by Jointly Learning to Align and Translate Dzmitry Bahdanau arXiv (Cornell University) Natural Language ProcessingComputer-assisted TranslationStructured PredictionSequence ModellingEngineering	2014	14.6K
Effective Approaches to Attention-based Neural Machine Translation Thang Luong, Hieu Pham, Christopher D. Manning	2015	8.5K
Librispeech: An ASR corpus based on public domain audio books Vassil Panayotov, Guoguo Chen, Daniel Povey, EngineeringSpeech CorpusSpoken Language ProcessingCorpus LinguisticsSpeech Recognition	2015	5.7K
Curriculum learning Yoshua Bengio, Jérôme Louradour, Ronan Collobert, Artificial IntelligenceModel OptimizationEngineeringMachine LearningComputational Learning Theory	2009	4.8K
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition Daniel Park, William Chan, Yu Zhang,	2019	3.4K
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition William Chan, Navdeep Jaitly, Quoc V. Le, EngineeringNeural NetworkSpoken Language ProcessingSpeech RecognitionNatural Language Processing	2016	2.3K
ESPnet: End-to-End Speech Processing Toolkit Shinji Watanabe, Takaaki Hori, Shigeki Karita, Software PlatformEngineeringMachine LearningMajor Asr BenchmarksNatural Language Processing	2018	1.3K
Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Natural Language ProcessingEngineeringComputational LinguisticsSpeech OutputSpeech Processing	2017	1.1K
Multi-Modal Curriculum Learning for Semi-Supervised Image Classification Chen Gong, Dacheng Tao, Stephen J. Maybank, IEEE Transactions on Image Processing EngineeringMachine LearningMultimodal LearningMulti-modal Curriculum LearningImage Classification	2016	285

Page 1