Publication | Closed Access
Extraction of visual features for lipreading
532
Citations
62
References
2002
Year
EngineeringFeature DetectionBiometricsFeature ExtractionLip DeformationsVisual Speech InformationSpeech RecognitionImage AnalysisData SciencePattern RecognitionPhoneticsVisual FeaturesRobust Speech RecognitionBiostatisticsVoice RecognitionHealth SciencesMachine VisionComputer ScienceDistant Speech RecognitionSpeech CommunicationComputer VisionSpeech TechnologyMultimodal NatureSpeech ProcessingSpeech InputSpeech Perception
The multimodal nature of speech is often ignored in human‑computer interaction, yet lip deformations and other body motions such as head movements convey additional information. The paper demonstrates how this complementary visual speech information can be leveraged for speech recognition. The study compares three HMM‑based lipreading feature extraction methods: two top‑down approaches that fit inner and outer lip contour models and derive features from PCA of shape or shape‑appearance, and a bottom‑up nonlinear scale‑space pixel‑intensity method, evaluated on a multitalker isolated‑letter visual speech task. Integrating multimodal speech cues, including visual information, improves intelligibility, especially when the acoustic signal is degraded.
The multimodal nature of speech is often ignored in human-computer interaction, but lip deformations and other body motion, such as those of the head, convey additional information. We integrate speech cues from many sources and this improves intelligibility, especially when the acoustic signal is degraded. The paper shows how this additional, often complementary, visual speech information can be used for speech recognition. Three methods for parameterizing lip image sequences for recognition using hidden Markov models are compared. Two of these are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape or shape and appearance, respectively. The third, bottom-up, method uses a nonlinear scale-space analysis to form features directly from the pixel intensity. All methods are compared on a multitalker visual speech recognition task of isolated letters.
| Year | Citations | |
|---|---|---|
Page 1
Page 1