Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition

TLDR

Human activity recognition has relied on engineered features, but recent work shows deep convolutional networks can automatically extract features from raw sensor data, while capturing the temporal dynamics of complex motor sequences is essential for accurate recognition. The study proposes a generic deep framework combining convolutional and LSTM units for multimodal wearable activity recognition that automates feature extraction, fuses sensors, and models temporal dynamics without expert feature design. The framework is evaluated on two datasets, including a public challenge set, and key architectural hyperparameters are analyzed to guide optimisation. The framework outperforms non‑recurrent deep networks on the challenge dataset by an average of 4% and up to 9% over prior results, and improves performance when fusing multimodal sensors compared to homogeneous modalities.

Abstract

Human activity recognition (HAR) tasks have traditionally been solved using engineered features obtained by heuristic processes. Current research suggests that deep convolutional neural networks are suited to automate feature extraction from raw sensor inputs. However, human activities are made of complex sequences of motor movements, and capturing this temporal dynamics is fundamental for successful HAR. Based on the recent success of recurrent neural networks for time series domains, we propose a generic deep framework for activity recognition based on convolutional and LSTM recurrent units, which: (i) is suitable for multimodal wearable sensors; (ii) can perform sensor fusion naturally; (iii) does not require expert knowledge in designing features; and (iv) explicitly models the temporal dynamics of feature activations. We evaluate our framework on two datasets, one of which has been used in a public activity recognition challenge. Our results show that our framework outperforms competing deep non-recurrent networks on the challenge dataset by 4% on average; outperforming some of the previous reported results by up to 9%. Our results show that the framework can be applied to homogeneous sensor modalities, but can also fuse multimodal sensors to improve performance. We characterise key architectural hyperparameters' influence on performance to provide insights about their optimisation.

References

Page 1

	Year	Citations

Page 1