Publication | Closed Access
Cepstral Vector Normalization Based on Stereo Data for Robust Speech Recognition
36
Citations
21
References
2007
Year
EngineeringMachine LearningStereo ImagingRobust FeaturePhoneme-dependent MemlinSpeech RecognitionImage AnalysisSpeech CodingPattern RecognitionRobust Speech RecognitionBias VectorPolynomial MemlinVoice RecognitionHealth SciencesMachine VisionDeep LearningMedical Image ComputingDistant Speech RecognitionSignal ProcessingComputer VisionComputer Stereo VisionSpeech ProcessingSpeaker RecognitionSpeech PerceptionStereoscopic ProcessingStereo DataCepstral Vector Normalization
In this paper, a set of feature vector normalization methods based on the minimum mean square error (MMSE) criterion and stereo data is presented. They include multi-environment model-based linear normalization (MEMLIN), polynomial MEMLIN (P-MEMLIN), multi-environment model-based histogram normalization (MEMHIN), and phoneme-dependent MEMLIN (PD-MEMLIN). Those methods model clean and noisy feature vector spaces using Gaussian mixture models (GMMs). The objective of the methods is to learn a transformation between clean and noisy feature vectors associated with each pair of clean and noisy model Gaussians. The direct approach to learn the transformation is by using stereo data; that is, noisy feature vectors and the corresponding clean feature vectors. In this paper, however, a nonstereo data based training procedure, is presented. The transformations can be modeled just like a bias vector (MEMLIN), or by using a first-order polynomial (P-MEMLIN) or a nonlinear function based on histogram equalization (MEMHIN). Further improvements are obtained by using phoneme-dependent bias vector transformation (PD-MEMLIN). In PD-MEMLIN, the clean and noisy feature vector spaces are split into several phonemes, and each of them is modeled as a GMM. Those methods achieve significant word error rate improvements over others that are based on similar targets. The experimental results using the SpeechDat Car database show an average improvement in word error rate greater than 68% in all cases compared to the baseline when using the original clean acoustic models, and up to 83% when training acoustic models on the new normalized feature space
| Year | Citations | |
|---|---|---|
Page 1
Page 1