Audio augmentation for speech recognition

TLDR

Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. The study investigates audio‑level speech augmentation methods that directly process the raw signal. The authors recommend changing audio speed to create three versions (0.9, 1.0, 1.1) and evaluate this on four LVCSR tasks with 100–1000 h of training data to assess effectiveness across data regimes. The technique is low‑cost and easy to adopt, achieving an average relative improvement of 4.3 % across the four tasks.

Abstract

Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. In this paper, we investigate audio-level speech augmentation methods which directly process the raw signal. The method we particularly recommend is to change the speed of the audio signal, producing 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.

References

Page 1

	Year	Citations

Page 1