Recent advances in the automatic recognition of audiovisual speech

TLDR

Visual speech information from the speaker’s mouth region improves noise robustness of automatic speech recognizers, promising to extend their usability in human‑computer interfaces. The paper reviews audiovisual ASR components and introduces novel visual front‑end design using a cascade of linear image transforms on a mouth‑region ROI, along with audiovisual speech integration. The authors implement feature‑ and decision‑level fusion, model audiovisual asynchrony, weight modality reliability, and evaluate these methods on three multimodal databases covering small‑ to large‑vocabulary tasks in both controlled and challenging visual settings. Experiments demonstrate that visual cues consistently enhance ASR performance across all conditions and datasets, though the improvement is reduced in visually challenging environments and large‑vocabulary tasks.

Abstract

Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audiovisual automatic speech recognition (ASR) and present novel contributions in two main areas: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovisual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audiovisual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual adaptation. We apply our algorithms to three multisubject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves ASR over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.

References

Page 1

	Year	Citations

Page 1