Concepedia

TLDR

Human motion is essential for behavior understanding, yet current video‑based 3D pose and shape methods produce inaccurate, unnatural sequences because they lack ground‑truth 3D motion data for training. VIBE is introduced to overcome this limitation by leveraging the AMASS motion capture dataset and unpaired in‑the‑wild 2D keypoint annotations. It employs an adversarial learning framework that discriminates real from generated motions using AMASS, coupled with a novel self‑attention temporal network trained at the sequence level to produce kinematically plausible motion without in‑the‑wild 3D labels. Experiments on challenging 3D pose datasets show that VIBE achieves state‑of‑the‑art performance, and its code and pretrained models are publicly available.

Abstract

Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose "Video Inference for Body Pose and Shape Estimation'' (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a novel temporal network architecture with a self-attention mechanism and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at https://github.com/mkocabas/VIBE

References

YearCitations

Page 1