Concepedia

Abstract

A machine being able to perform lip-reading would have been deemed impossible a few decades ago. However, the exponential growth of machine learning in the past few years has made it possible for a machine to understand human speech based on visual inputs alone. Numerous research studies infer that a very less percentage of the English language can be comprehended through visual data alone, i.e. lip reading. Visual speech recognition experts can only infer about 3–4% of words spoken through lip-reading after viewing videos (without audio) multiple times. These experts also examine other parameters such as body language, facial cues, habits, and context to some extent. This task is very tedious (or exhausting). The proposed visual speech recognition approach has used the concept of deep learning to perform word-level classification. ResNet architecture is used with 3D convolution layers as the encoder and Gated Recurrent Units (GRU) as the decoder. The whole video sequence was used as an input in this approach. The results of the proposed approach are satisfactory. It achieves 90% accuracy on the BBC data set and 88% on the custom video data set. The proposed approach is limited to word-level only and can easily be extended to short phrases or sentences.

References

YearCitations

Page 1