High-resolution voice transformation

Abstract

Speaker identity, the sound of a person's voice, plays an important role in human communication. With speech systems becoming more and more ubiquitous, Voice Transformation (VT), a technology that modifies a source speaker's speech utterance to sound as if a target speaker had spoken it, offers a number of useful applications. For example, a novice user can adapt a text-to-speech system to speak with a new voice quickly and inexpensively. In this dissertation, we consider new approaches in both the design and the evaluation of VT techniques. We propose a new type of speech corpus that is especially suited to VT research and development by consisting of naturally time-aligned sentences. Consequently, removal of individual prosodic characteristics, such as fundamental pitch and durations, requires only very little processing and results in high-quality speech samples that only differ in their segmental properties, our focus of transformation. These “prosody-normalized” speech samples are used for training VT systems, as well as for evaluating their transformation performance objectively and subjectively. Our baseline transformation system (SET) is based on transforming the spectral envelope as represented by the LPC spectrum, using a harmonic sinusoidal model for analysis and synthesis. The transformation function is implemented as a regressive, joint-density Gaussian mixture model, trained on aligned LSF vectors by an expectation maximization algorithm. We improve upon the baseline by adding a residual prediction module, which predicts target LPC residuals from transformed LPC spectral envelopes, using a classifier and residual codebooks. The resulting high resolution transformation system (HRT) is capable of rendering transformed speech with a high degree of spectral detail. Because of the severe shortcomings of evaluating VT performance objectively, we propose a subjective evaluation strategy, consisting of several listening tests. In a speaker discrimination test, the HRT system performed significantly better than the SET system. However, discrimination is below that of natural utterances. Similarly, listeners selected the HRT system over other systems in a system comparison test. Finally, listeners rated the speech quality of the HRT system as better than the SET system. However, the quality of natural utterances was considered better than that of transformed speech.