Audio-visual speech recognition

Abstract

We have made signi cant progress in automatic speech recognition ASR for well-de ned applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments.However, for ASR to approach h uman levels of performance and for speech to become a truly pervasive user interface, we need novel, nontraditional approaches that have the potential of yielding dramatic ASR improvements.Visual speech is one such source for making large improvements in high noise environments with the potential of channel and task independence.It is not e ected by the acoustic environment and noise, and it possibly contains the greatest amount of complementary information to the acoustic signal.In this workshop, our goal was to advance the state-of-the-art in ASR by demonstrating the use of visual information in addition to the traditional audio for large vocabulary continuous speech recognition LVCSR.Starting with an appropriate audio-visual database, collected and provided by IBM, we demonstrated for the rst time that LVCSR performance can be improved by the use of visual information in the clean audio case.Speci cally, b y conducting audio lattice rescoring experiments, we showed a 7 relative word error rate WER reduction in that condition.Furthermore, for the harder problem of speech contaminated by s p e e c h babble" noise at 10 dB SNR, we demonstrated that recognition performance can beimproved by 27 in relative WER reduction, compared to an equivalent audio-only recognizer matched to the noise environment.We believe that this paves the way to seriously address the challenge of speech recognition in high noise environments and to potentially achieve human levels of performance.In this report, we detail a number of approaches and experiments conducted during the summer workshop in the areas of visual feature extraction, hidden Markov model based visual-only recognition, and audio-visual information fusion.The later was our main concentration: In the workshop, a numberof feature fusion as well as decision fusion techniques for audio-visual ASR were explored and compared.