UniPose: Unified Human Pose Estimation in Single Images and Videos

TLDR

The authors propose UniPose, a unified framework for human pose estimation that achieves state‑of‑the‑art results on several metrics. UniPose uses a Waterfall Atrous Spatial Pooling architecture with contextual segmentation and joint localization in a single stage, employing progressive filtering cascades for multi‑scale fields of view and extending to a UniPose‑LSTM for temporal pose estimation. Experiments on multiple datasets show that UniPose, with a ResNet backbone and Waterfall module, achieves state‑of‑the‑art performance for single‑person pose detection in both images and videos.

Abstract

We propose UniPose, a unified framework for human pose estimation, based on our “Waterfall” Atrous Spatial Pooling architecture, that achieves state-of-art-results on several pose estimation metrics. UniPose incorporates contextual segmentation and joint localization to estimate the human pose in a single stage, with high accuracy, without relying on statistical postprocessing methods. The Waterfall module in UniPose leverages the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Additionally, our method is extended to UniPose-LSTM for multi-frame processing and achieves state-of-the-art results for temporal pose estimation in Video. Our results on multiple datasets demonstrate that UniPose, with a ResNet backbone and Waterfall module, is a robust and efficient architecture for pose estimation obtaining state-of-the-art results in single person pose detection for both single images and videos.

References

Page 1

	Year	Citations

Page 1