Publication | Closed Access
Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data
24
Citations
46
References
2023
Year
Unknown Venue
EngineeringMachine LearningWhisper-style TrainingMultilingual PretrainingOpenai WhisperSpeech RecognitionNatural Language ProcessingData ScienceComputational LinguisticsPre-training Speech ModelsSpeech InterfaceConversation AnalysisAutomatic RecognitionVoice RecognitionPublicly Available DataMachine TranslationHealth SciencesSpeech ModelsSpeech SynthesisSpeech OutputPre-trained ModelsComputer ScienceSpeech CommunicationSpeech TechnologySpeech AcousticsSpeech ProcessingHuman-computer InteractionSpeech InputOpen-source ToolkitSpeech PerceptionSpeech TranslationSupervised Speech Data
Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisperstyle training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pretrained models and training logs to promote open science. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> https://github.com/espnet/espnet
| Year | Citations | |
|---|---|---|
Page 1
Page 1