Publication | Open Access
Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction
183
Citations
32
References
2020
Year
Unknown Venue
EngineeringMachine LearningHuman Pose Estimation3D Pose EstimationBiometricsHuman-object InteractionImage AnalysisData SciencePattern RecognitionComputational ImagingRobot LearningPose Estimation AccuracyMachine VisionInverse ProblemsComputer ScienceVideo UnderstandingDeep LearningComputer VisionHand-object ReconstructionSparse RepresentationPhotometric ConsistencyScene Understanding3D ReconstructionMulti-view GeometryHand-object ManipulationsScene Modeling
Modeling hand‑object interactions is crucial for understanding human manipulation, yet pose estimation is difficult due to occlusions and the high cost of collecting 3D ground‑truth data. The authors aim to use photometric consistency across time with sparse annotations to reconstruct hands and objects in 3D. They train an end‑to‑end network that jointly reconstructs hand and object poses from color images, differentiably renders optical flow between adjacent frames, and applies a self‑supervised photometric loss to warp frames and enforce consistency. The approach achieves state‑of‑the‑art performance on 3D hand‑object benchmarks, improving pose accuracy by exploiting neighboring frames in low‑data regimes.
Modeling hand-object manipulations is essential for understanding how humans interact with their environment. While of practical importance, estimating the pose of hands and objects during interactions is challenging due to the large mutual occlusions that occur during manipulation. Recent efforts have been directed towards fully-supervised methods that require large amounts of labeled training samples. Collecting 3D ground-truth data for hand-object interactions, however, is costly, tedious, and error-prone. To overcome this challenge we present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video. Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses. Given our estimated reconstructions, we differentiably render the optical flow between pairs of adjacent images and use it within the network to warp one frame to another. We then apply a self-supervised photometric loss that relies on the visual consistency between nearby images. We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy by leveraging information from neighboring frames in low-data regimes.
| Year | Citations | |
|---|---|---|
Page 1
Page 1