Publication | Closed Access
WhisperX: Time-Accurate Speech Transcription of Long-Form Audio
192
Citations
22
References
2023
Year
Unknown Venue
Speech SciencesEngineeringBatch InputSpeech RecognitionAudio Signal AnalysisAutomatic RecognitionVoice RecognitionReal-time LanguageSpeech Signal AnalysisHealth SciencesEfficient Speech TranscriptionSpeech SynthesisSpeech OutputComputer ScienceText-to-speechSignal ProcessingSpeech CommunicationVoiceSpeech AcousticsTime-accurate Speech TranscriptionSpeech ProcessingSpeech InputActive Speech Regions.theSpeech Perception
Batch Input audio <|transcribe|> Pad to 30sFigure 1: WhisperX: We present a system for efficient speech transcription of long-form audio with word-level time alignment.The input audio is first segmented with Voice Activity Detection and then cut & merged into approximately 30-second input chunks with boundaries that lie on minimally active speech regions.The resulting chunks are then: (i) transcribed in parallel with whisper and (ii) forced aligned with a phone recognition model to produce accurate word-level timestamps at high throughput.
| Year | Citations | |
|---|---|---|
Page 1
Page 1