Generative Adversarial Network-Based Glottal Waveform Model for Statistical Parametric Speech Synthesis

Abstract

Recent studies have shown that text-to-speech synthesis quality can be\nimproved by using glottal vocoding. This refers to vocoders that parameterize\nspeech into two parts, the glottal excitation and vocal tract, that occur in\nthe human speech production apparatus. Current glottal vocoders generate the\nglottal excitation waveform by using deep neural networks (DNNs). However, the\nsquared error-based training of the present glottal excitation models is\nlimited to generating conditional average waveforms, which fails to capture the\nstochastic variation of the waveforms. As a result, shaped noise is added as\npost-processing. In this study, we propose a new method for predicting glottal\nwaveforms by generative adversarial networks (GANs). GANs are generative models\nthat aim to embed the data distribution in a latent space, enabling generation\nof new instances very similar to the original by randomly sampling the latent\ndistribution. The glottal pulses generated by GANs show a stochastic component\nsimilar to natural glottal pulses. In our experiments, we compare synthetic\nspeech generated using glottal waveforms produced by both DNNs and GANs. The\nresults show that the newly proposed GANs achieve synthesis quality comparable\nto that of widely-used DNNs, without using an additive noise component.\n

References

Page 1

	Year	Citations

Page 1