Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech

Abstract

In this paper, we evaluate the vulnerability of a speaker verification\n(SV) system to synthetic speech. Although this problem\nwas first examined over a decade ago, dramatic improvements\nin both SV and speech synthesis have renewed interest in\nthis problem. We use a HMM-based speech synthesizer, which\ncreates synthetic speech for a targeted speaker through adaptation\nof a background model and a GMM-UBM-based SV system.\nUsing 283 speakers from the Wall-Street Journal (WSJ)\ncorpus, our SV system has a 0.4% EER. When the system\nis tested with synthetic speech generated from speaker models\nderived from the WSJ journal corpus, 90% of the matched\nclaims are accepted. This result suggests a possible vulnerability\nin SV systems to synthetic speech. In order to detect\nsynthetic speech prior to recognition, we investigate the\nuse of an automatic speech recognizer (ASR), dynamic-timewarping\n(DTW) distance of mel-frequency cepstral coefficients\n(MFCC), and previously-proposed average inter-frame difference\nof log-likelihood (IFDLL). Overall, while SV systems\nhave impressive accuracy, even with the proposed detector,\nhigh-quality synthetic speech can lead to an unacceptably high\nacceptance rate of synthetic speakers.

References

Page 1

	Year	Citations

Page 1