Unit selection in a concatenative speech synthesis system using a large speech database

TLDR

Unit selection from a large speech database is a common approach to generate natural‑sounding synthesized speech, where phoneme units are chosen to match a target phoneme sequence predicted from text annotated with prosodic and phonetic context, and this framework resembles HMM‑based speech recognition. The study proposes modeling the synthesis database as a state transition network in which unit selection costs are based on distance to the target and concatenation quality. A pruned Viterbi search is employed to select the optimal units for synthesis. This approach permits training from natural speech, and two training methods are presented that yield more natural speech than hand‑tuning.

Abstract

One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning.

References

Page 1

	Year	Citations

Page 1