Publication | Closed Access
Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop
92
Citations
8
References
2001
Year
Unknown Venue
EngineeringMachine LearningSpoken Language ProcessingSpeech RecognitionImage AnalysisData SciencePattern RecognitionRobust Speech RecognitionVoice RecognitionHealth SciencesNoisy SpeechComputer ScienceDeep LearningDistant Speech RecognitionSpeech CommunicationComputer VisionDct Visual FeaturesMulti-speaker Speech RecognitionSpeech ProcessingSpeech InputSpeech PerceptionLinguisticsVisual Feature Extraction
We report a summary of the Johns Hopkins Summer 2000 Workshop on audio-visual automatic speech recognition (ASR) in the large-vocabulary, continuous speech domain. Two problems of audio-visual ASR were mainly addressed: visual feature extraction and audio-visual information fusion. First, image transform and model-based visual features were considered, obtained by means of the discrete cosine transform (DCT) and active appearance models, respectively. The former were demonstrated to yield superior automatic speech reading. Subsequently, a number of feature fusion and decision fusion techniques for combining the DCT visual features with traditional acoustic ones were implemented and compared. Hierarchical discriminant feature fusion and asynchronous decision fusion by means of the multi-stream hidden Markov model consistently improved ASR for both clean and noisy speech. Compared to an equivalent audio-only recognizer, introducing the visual modality reduced ASR word error rate by 7% relative in clean speech, and by 27% relative at an 8.5 dB SNR audio condition.
| Year | Citations | |
|---|---|---|
Page 1
Page 1