Converting English text to speech: a machine learning approach

Abstract

The task of mapping spelled English words into strings of phonemes and stresses (reading aloud) has many practical applications. Several commercial systems perform this task by applying a knowledge base of expert-supplied letter-to-sound rules. This dissertation presents a set of machine learning methods for automatically constructing letter-to-sound rules by analyzing a dictionary of words and their pronunciations. Taken together, these methods provide a substantial performance improvement over the best commercial system--DECtalk from Digital Equipment Corporation. In a performance test, the learning methods were trained on a dictionary of 19,002 words. Then, human subjects were asked to compare the performance of the resulting letter-to-sound rules against the dictionary for an additional 1,000 words not used during training. In a blind procedure, the subjects rated the pronunciations of both the learned rules and the DECtalk rules according to whether they were noticably different from the dictionary pronunciation. The error rate for the learned rules was 28.8% (288 words noticeably different), while the error rate for the DECtalk rules was 44.3% (433 words noticeably different). If, instead of using human judges, were required that the pronunciations of the letter-to-sound rules exactly match the dictionary to be counted correct, then the error rate for our learned rules is 35.2% and the error rate for DECtalk is 63.6%. Similar results were observed at the level of individual letters, phonemes, and stresses. To achieve these results, several techniques were combined. The key learning technique represents the output classes by the codewords of an error-correcting code. Boolean concept learning methods, such as the standard ID3 decision-tree algorithm, can be applied to learn the individual bits of these codewords. This converts the muticlass learning problem into a number of boolean concept learning problems. This method is shown to be superior to several other methods: multiclass ID3, one-tree-per-class ID3, the domain-specific distributed code employed by T. Sejnowski and C. Rosenberg in their NETtalk system, and a method developed by D. Wolpert. Similar results in the domain of isolated-letter speech recognition with the backpropagation algorithm show that error-correcting output codes provide a domain-independent, algorithm-independent approach to multiclass learning problems.