3D Tracking of Human Hands in Interaction with Unknown Objects

Abstract

The analysis and the understanding of object manipulation scenarios based on computer vision techniques can be greatly facilitated if we can gain access to the full articulation of the manipulating hands and the 3D pose of the manipulated objects. Currently, there exist methods for tracking hands in interaction with objects whose 3D models are known [2]. There are also methods that can reconstruct 3D models of objects that are partially observable in each frame of a sequence [3]. However, no method can track hands in interaction with unknown objects, ie objects whose 3D model is not known a priori. In this paper we propose a novel approach that can track human hands in interaction with unknown objects. As illustrated in Fig.1, the input to the method is a sequence of RGBD frames showing the interaction of one or two hands with an unknown object. Starting with the raw depth map (left) we perform a pre-processing step and compute the scene point cloud. We employ an appropriately modified model based hand tracker [4] and temporal information to track the hand 3D positions and posture (middle bottom). In this process, a progressively built object model is also taken into account to cope with hand-object occlusions. We use the estimated fingertip positions of the hand to segment the manipulated object from the rest of the scene (middle top). The segmented object points are used to update the object position and orientation in the current frame and are integrated into the object 3D representation (right). More specifically, the work flow of the proposed approach consists of five main components linked together as shown in Fig. 2. At a first, preprocessing stage, the raw depth information from the sensor is prepared to enter the pipeline. A point cloud is computed along with the normals for each vertex. Then, the user’s hands are tracked in the scene. An articulated model for the left and right hands, with 26 degrees of freedom each, is fit to the pre-processed depth input. The current, possibly incomplete (or even empty, for the first frame) object model is incorporated to hand tracking to assist in handling hand/object occlusions. Using the computed 3D location of the user’s hands as well as the last position of the (possibly incomplete) object model, the region of the object is segmented in the input depth map. The hands are masked-out from the observation, by comparing it to the rendered hand models. Object tracking is achieved using a mutli-scale ICP [1]. The segmented object depth is used for a coarse to fine alignment with the (partially reconstructed) object model. Finally, the segmented and aligned depth data of the object with the current, partial 3D model are merged. The object’s 3D model is maintained in a voxel grid with a Truncated Signed Distance Function (TSDF) [3] representation. Experiment Proposed [2], GT model [2], Scanned model mean/median error mean/median error mean/median error Single hand, cat 0.42 / 0.39 0.47 / 0.43 0.45 / 0.43 Single hand, spray 0.65 / 0.63 0.70 / 0.53 0.63 / 0.47 Two hands, cat 0.38 / 0.34 0.33 / 0.31 0.44 / 0.39 Two hands, spray 0.59 / 0.44 0.51 / 0.38 0.62 / 0.41 Table 1: Hand tracking accuracy (in cm) measured on the synthetic datasets. The accuracy of the method is close to that of [2], although the latter assumes that the object model is known a priori.

References

Page 1

	Year	Citations

Page 1