Publication | Closed Access
Hierarchical discriminant features for audio-visual LVCSR
75
Citations
10
References
2001
Year
Unknown Venue
EngineeringBiometricsSpeech RecognitionSecond StageImage AnalysisPattern RecognitionAudio Signal ProcessingAudio-visual FeaturesAudio AnalysisRobust Speech RecognitionVoice RecognitionHealth SciencesLinear Discriminant AnalysisHierarchical Discriminant FeaturesAudio RetrievalDeep LearningDistant Speech RecognitionSpeech CommunicationComputer VisionMulti-speaker Speech RecognitionSpeech ProcessingSpeech InputSpeech Perception
We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-visual features that improve automatic speech recognition. Linear discriminant analysis (LDA), followed by a maximum likelihood linear transform (MLLT) is first applied to MFCC based audio-only features, as well as on visual only features, obtained by a discrete cosine transform of the video region of interest. Subsequently, a second stage of LDA and MLLT is applied to the concatenation of the resulting single modality features. The obtained audio-visual features are used to train a traditional HMM based speech recognizer. Experiments on the IBM ViaVoice/sup TM/ audio-visual database demonstrate that the proposed feature fusion method improves speaker-independent, large vocabulary, continuous speech recognition (LVCSR) for both clean and noisy audio conditions considered. A 24% relative word error rate reduction over an audio-only system is achieved in the latter case.
| Year | Citations | |
|---|---|---|
Page 1
Page 1