Three Methods of Intonation Modeling

Abstract

This paper compares different methods of generating intonation for an American English Text-to-Speech synthesis system. We look at a primarily rule-based approach and two data-driven approaches. For data-driven modeling we used two separate data sets, each representing a somewhat different prosodic style. One database was recordings of a portion of 1989 Wall Street Journal text from the Penn Treebank Project. The second database was recordings of interactive prompts used in telephone network services. Both were read by the same female speaker. Approximately two and one-half hours of speech was phonetically and prosodically segmented and labeled (first automatically, and subsequently verified manually) . The prosodic labeling used ToBI [7] tones and breaks. Three different intonation models were compared: (1) a predominantly rule-based model based on ToBI labels [3]; (2) a parametric model using the Tilt approach [8]; and (3) a Vector Quantized model based on an underlying parametric re...