Deep Learning-Based Fast Hand Gesture Recognition Using Representative Frames

Abstract

In this paper, we propose a vision-based hand gesture recognition system for intelligent vehicles. Vision-based gesture recognition systems are employed in automotive user interfaces to increase the driver comfort without compromising their safety. In our algorithm, the long-term recurrent convolution network is used to classify the video sequences of hand gestures. In the standard long-term recurrent convolution network-based action classifier, multiple frames sampled from the video sequence are given as an input to the network, to perform classification. However, the use of multiple frames increases the computational complexity, apart from reducing the classification accuracy of the classifier. We propose to address these issues by extracting a fewer representative frames from the video sequence, and inputting them to the long-term recurrent convolution network. To extract the representative frames, we propose to use novel tiled image patterns and tiled binary pattern within a semantic segmentation- based deep learning framework, the deconvolutional neural network. The novel tiled image patterns contain multiple non-overlapping blocks and represent the entire gesture-based video sequence within a single tiled image. These image patterns represent the input to the deconvolution network and are generated from the video sequence. The novel tiled binary pattern also contain multiple non-overlapping blocks and represent the representative frames of the video sequence. These binary patterns represent the output of the deconvolution network. The training binary patterns are generated from the training video sequences using the dictionary learning and sparse modeling framework. We validate our proposed algorithm on the public Cambridge gesture recognition dataset. A comparative analysis is performed with baseline algorithms and an improved classification accuracy is observed. We also perform a detailed parametric analysis of the proposed algorithm. We report a gesture classification accuracy of 91% and report a near real-time computational complexity of $110$~ms per video sequence.

References

Page 1

	Year	Citations

Page 1