Publication | Closed Access
A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video
108
Citations
28
References
2021
Year
Unknown Venue
Machine VisionImage AnalysisMachine LearningKinesiologyEngineering3D Pose EstimationHuman Pose EstimationScene ModelingVideo UnderstandingRobot LearningDepth AmbiguityDeep LearningSpatio-temporal Information3D Object RecognitionVideo InterpretationComputer Vision
Spatio-temporal information is key to resolve occlusion and depth ambiguity in 3D human pose estimation. Previous methods have focused on either temporal contexts or local-to-global architectures that embed fixed-length spatiotemporal information. To date, there have not been effective proposals to simultaneously and flexibly capture varying spatiotemporal sequences and effectively achieves real-time 3D human pose estimation. In this work, we improve the learning of kinematic constraints in the human skeleton: posture, local kinematic connections, and symmetry by modeling local and global spatial information via attention mechanisms. To adapt to single- and multi-frame estimation, the dilated temporal model is employed to process varying skeleton sequences. Also, importantly, we carefully design the interleaving of spatial semantics with temporal dependencies to achieve a synergistic effect. To this end, we propose a simple yet effective graph attention spatio-temporal convolutional network (GAST-Net) that comprises of interleaved temporal convolutional and graph attention blocks. Experiments on two challenging benchmark datasets (Human3.6M and HumanEva-I) and YouTube videos demonstrate that our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation. Code, video, and supplementary information is available at: http://www.juanrojas.net/gast/
| Year | Citations | |
|---|---|---|
Page 1
Page 1