Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005

TLDR

The Blizzard Challenge 2005 was an open evaluation of corpus‑based text‑to‑speech systems using common speech datasets. This paper details the technical construction and performance evaluation of the Nitech‑HTS 2005 speech synthesis system. The system is an HMM‑based synthesizer enhanced with STRAIGHT vocoding, HSMM acoustic modeling, and a GV‑aware parameter generation algorithm. The voices run at 0.3× real time with under 2 MB footprints, and subjective tests indicate naturalness and intelligibility far exceed expectations.

Abstract

In January 2005, an open evaluation of corpus-based text-to-speech synthesis systems using common speech datasets, named Blizzard Challenge 2005, was conducted. Nitech group participated in this challenge, entering an HMM-based speech synthesis system called Nitech-HTS 2005. This paper describes the technical details, building processes, and performance of our system. We first give an overview of the basic HMM-based speech synthesis system, and then describe new features integrated into Nitech-HTS 2005 such as STRAIGHT-based vocoding, HSMM-based acoustic modeling, and a speech parameter generation algorithm considering GV. Constructed Nitech-HTS 2005 voices can generate speech waveforms at 0.3 ×RT (real-time ratio) on a 1.6 GHz Pentium 4 machine, and footprints of these voices are less than 2 Mbytes. Subjective listening tests showed that the naturalness and intelligibility of the Nitech-HTS 2005 voices were much better than expected.

References

Page 1

	Year	Citations

Page 1