Generating diverse and natural text-to-speech samples using a quantized\n fine-grained VAE and auto-regressive prosody prior

Abstract

Recent neural text-to-speech (TTS) models with fine-grained latent features\nenable precise control of the prosody of synthesized speech. Such models\ntypically incorporate a fine-grained variational autoencoder (VAE) structure,\nextracting latent features at each input token (e.g., phonemes). However,\ngenerating samples with the standard VAE prior often results in unnatural and\ndiscontinuous speech, with dramatic prosodic variation between tokens. This\npaper proposes a sequential prior in a discrete latent space which can generate\nmore naturally sounding samples. This is accomplished by discretizing the\nlatent features using vector quantization (VQ), and separately training an\nautoregressive (AR) prior model over the result. We evaluate the approach using\nlistening tests, objective metrics of automatic speech recognition (ASR)\nperformance, and measurements of prosody attributes. Experimental results show\nthat the proposed model significantly improves the naturalness in random sample\ngeneration. Furthermore, initial experiments demonstrate that randomly sampling\nfrom the proposed model can be used as data augmentation to improve the ASR\nperformance.\n