A Simple Yet Effective Baseline for 3d Human Pose Estimation

TLDR

Deep convolutional networks have driven state‑of‑the‑art 3D human pose estimation, yet it remains unclear whether remaining errors arise from limited 2D pose understanding or from mapping 2D poses into 3D space. The authors aim to dissect these error sources by building a system that predicts 3D joint positions from 2D joint locations. They employ a simple deep feedforward network that lifts 2D joint coordinates to 3D, enabling analysis of visual versus mapping errors. The lifting approach achieves a remarkably low error, outperforming the best reported result by about 30 % on Human3.6M, and when trained on the output of an off‑the‑shelf 2D detector it attains state‑of‑the‑art performance, indicating that visual analysis dominates current 3D pose estimation error and pointing to future improvement directions.

Abstract

Following the success of deep convolutional networks, state-of-the-art methods for 3d human pose estimation have focused on deep end-to-end systems that predict 3d joint locations given raw image pixels. Despite their excellent performance, it is often not easy to understand whether their remaining error stems from a limited 2dpose (visual) understanding, or from a failure to map 2d poses into 3dimensional positions. With the goal of understanding these sources of error, we set out to build a system that given 2d joint locations predicts 3d positions. Much to our surprise, we have found that, with current technology, "lifting" ground truth 2djoint locations to 3d space is a task that can be solved with a remarkably low error rate: a relatively simple deep feedforward network outperforms the best reported result by about 30% on Human3.6M, the largest publicly available 3d pose estimation benchmark. Furthermore, training our system on the output of an off-the-shelf state-of-the-art 2d detector (i.e., using images as input) yields state of the art results - this includes an array of systems that have been trained end-to-end specifically for this task. Our results indicate that a large portion of the error of modern deep 3d pose estimation systems stems from their visual analysis, and suggests directions to further advance the state of the art in 3d human pose estimation.

References

Page 1

	Year	Citations

Page 1