Acoustic modeling based on the MDL principle for speech recognition

Abstract

ACOUSTIC MODELING BASED ON THE MDLPRINCIPLE FOR SPEECH RECOGNITIONKoichi Shinoda and Takao WatanabeNEC Corp oration4-1-1 Miyazaki, Miyamae-ku, Kawasaki 216, JAPANfshino da,watanab eg@hum.cl .nec.co.jpABSTRACTRecently context-dep endent phone units, such as tri-phones, have b een used to mo del subword units in sp eechrecognition based on Hidden Markov Mo dels (HMMs).While most such metho ds employ clustering of theHMM parameters(e.g., subword clustering, state cluster-ing, etc.), to control HMM size so as to avoid p o or recogni-tion accuracy due to an insuciency of training data, noneof them provide any e ective criterion for the optimal de-gree of clustering that should b e p erformed. This pap erprop oses a metho d in which state clustering is accom-plished byway of phonetic decision trees and in which theMDL criterion is used to optimize the degree of cluster-ing. Large-vo cabulary Japanese recognition exp erimentsshow that the mo dels obtained by this metho d achievedthe highest accuracy among the mo dels of various sizesobtained with conventional clustering approaches.1.INTRODUCTIONOver the past few years, extensive studies have b een car-ried out on sp eaker-indep end ent sp eech recognition us-ing continuous density Hidden Markov Mo dels (HMMs).It is well known that in most such systems, the use ofcontext-dep endent(CD) phone mo dels instead of context-indep endent(CI) phone mo dels(monophon es), improvesrecognition accuracy[1-7].Since the numb er of CD mo dels is usually much largerthan that of CI mo dels, using CD mo dels b etter capturesvariations in sp eech data. However, the amountof aail-able training data is likely to b e insucient to supp ortthe use of such a large numb er of CD mo dels. It is oftenimpractical to prepare such a large amount of data. Fur-thermore, the frequency with which a CD phone app earsin training data usually di ers substantiall y in the set ofCD phones; in most case, the frequencies for some CDphones are so small that those CD phones do not app earin training data even if a large amount of data is pro-vided. This data insuciency often causes serious degra-dation in sp eech recognition p erformance. Most recogni-tion systems using CD mo dels employ clustering of mo delparameters to try to alleviate part of the problem.Various clustering metho ds have b een develop ed for thispurp ose. First, there are several choices for the units towhich clustering is carried out; K.F. Leeet al.[1], for ex-ample, use subword clustering, Hwanget al.[2] use stateclustering, and Digalakis et al.[3] cluster the mixture com-p onents of the HMMs with Gaussian-mixture state ob-servation densities.Second, there are several metho dsto select the acoustically-si mi lar units to b e clustered.Some metho ds use only the acoustic characteristics of thedata and the merging of the units are carried out in ab ottom-up manner[4 , 2, 3 ]. The other metho ds, in addi-tion, utilizea prioriknowledge ab out acoustic similariti esbetween the units, which are mostly represented by deci-sion trees[1, 5, 6, 7]. In most of the latter metho ds, split-ting of the units of CI mo dels is carried out in a top-downmanner, instead of merging the units of CD mo dels.In these clustering metho ds, it is imp ortant to prop-erly measure the acoustic similariti es b etween the unitsutilizing training data, in order to select the units tob e clustered.One of the most successful approachis the approach based on the maximum-likel ihood(ML)criterion(e.g.,[7 ]).In the following,for simplicity,the splitting metho d(top-down clustering) is explained,though the similar explanation is also applicable to themerging metho d(b ottom-up clustering). In this approach,the increase of the likeliho o d by splitting is calculated foreach unit in the unit set, and the unit that has the largestincrease is selected and split.However, this ML approach has one drawback. In mostcase, the likelihood becomes larger as the numb er of unitsb ecomes larger.In the nal stage of the splitting, themo del set b ecomes almost identical to the set of CD mo d-els without clustering. Therefore, this approach requiresan external parameter to control the degree of clustering.Most metho ds limit splitting using a threshold on the in-crease in the likeliho o d or on the numb er of units. Thesethresholds needs to b e optimized through a series of recog-nition exp eriments using test data or by a cross-validationmetho d. These optimization pro cesses are computation-ally exp ensive, need more data, and have no strong theo-retical justi cation.In this pap er we prop ose a new approach in whicha minimum description length(MDL) criterion, insteadof the ML criterion, is used for clustering.The MDLapproach[9 ] is based on an information theoretic criterion,which has b een used for selecting the probabilisti c mo delwith an appropriate complexity for the given amountofdata. This MDL criterion is e ective not only for select-ing the units to b e split, but also for deciding whether tostop splitting. Therefore, no other external parameter isneeded to control the degree of clustering. We apply thiscriterion to state splitting using phonetic decision tree.2.MDL CRITERIONMDL[9] is an information criterion which has b een provento b e e ective in selecting the optimal mo del from amongvarious probabilis tic mo dels. The MDL criterion selectsthe mo del with the minimum description length for thegiven data as the optimal mo del from among a set of mo d-els. When a set of mo delsf1; :::;i;:::;Igis given, the de-scription length,li(xN), of the data,f=1;:::;xNg,together with an underlying mo deliis given by,l(i)=logP^(i)xN)+i2N+ logI(1)whereiis the dimensionali ty (the numb er of free param-eters) of mo deli, and^(i)is the maximum likeliho o d es-timates for the parameters(i)=(1;:::;i)ofmodeli. The rst term in (1) is the co de length for the dataxNwhen mo deliis used as a probabili stic mo del. This term

References

Page 1

	Year	Citations

Page 1