Publication | Closed Access
Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory
969
Citations
42
References
2007
Year
Voice ConversionSpectral ConversionHealth SciencesVoiceEngineeringSpeech SynthesisRobust Speech RecognitionGaussian Mixture ModelSpeech OutputSpeech ProcessingVoice RecognitionSpeech PerceptionDistant Speech RecognitionSignal ProcessingSpeech CommunicationSpeaker RecognitionSpeech Recognition
Conventional voice conversion uses frame‑by‑frame MSE‑based spectral conversion, but this often yields inappropriate spectral movements and excessive smoothing, degrading speech quality. The study proposes a voice‑conversion method that uses maximum‑likelihood estimation of spectral parameter trajectories to overcome frame‑based conversion shortcomings. The method employs a Gaussian mixture model of joint source‑target features, incorporating both static and dynamic statistics and a global‑variance constraint to generate appropriate spectral trajectories. Experiments show that the proposed method markedly improves speech quality and speaker‑specific conversion accuracy compared to conventional approaches.
In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.
| Year | Citations | |
|---|---|---|
Page 1
Page 1