Extending Parrotron: An End-to-End, Speech Conversion and Speech Recognition Model for Atypical Speech

TLDR

The study introduces an extended Parrotron model that performs simultaneous voice conversion and speech recognition, and evaluates its effectiveness on atypical speech. The end‑to‑end network transforms input spectrograms into target‑speaker spectrograms while generating text hypotheses, employing speaker adaptation and a custom data‑augmentation synthesizer. Speaker adaptation with only an hour of atypical speech reduces WER by 77 %, data augmentation adds a further 10 % relative improvement, and the approach generalizes across eight atypical speech types.

Abstract

We present an extended Parrotron model: a single, end-to-end network that enables voice conversion and recognition simultaneously. Input spectrograms are transformed to output spectrograms in the voice of a predetermined target speaker while also generating hypotheses in a target vocabulary. We study the performance of this novel architecture, which jointly predicts speech and text, on atypical (e.g. dysarthric) speech. We show that with as little as an hour of atypical speech, speaker adaptation can yield a 77% relative reduction in Word Error Rate (WER), measured by ASR performance on the converted speech. We also show that data augmentation using a customized synthesizer built on atypical speech can provide an additional 10% relative improvement over the best speaker-adapted model. Finally, we show how these methods generalize across 8 types of atypical speech for a range of speech impairment severities.

References

Page 1

	Year	Citations

Page 1