Speaker adaptation of neural network acoustic models using i-vectors

TLDR

The study proposes adapting DNN acoustic models to a target speaker by feeding speaker identity vectors (i‑vectors) alongside acoustic features into the network for ASR. Both during training and testing, each speaker’s i‑vector is concatenated to every frame of that speaker, providing speaker‑specific conditioning to the DNN. Experiments on a 300‑hour Switchboard corpus show that DNNs conditioned on i‑vectors achieve a 10% relative WER reduction over speaker‑independent models, match the performance of VTLN/FMLLR‑adapted DNNs with a single decoding pass, and further improve by 5–6% after hessian‑free sequence training when combined with speaker‑adapted features.

Abstract

We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.

References

Page 1

	Year	Citations

Page 1