Contrast-reconstruction Representation Learning for Self-supervised\n Skeleton-based Action Recognition

Abstract

Skeleton-based action recognition is widely used in varied areas, e.g.,\nsurveillance and human-machine interaction. Existing models are mainly learned\nin a supervised manner, thus heavily depending on large-scale labeled data\nwhich could be infeasible when labels are prohibitively expensive. In this\npaper, we propose a novel Contrast-Reconstruction Representation Learning\nnetwork (CRRL) that simultaneously captures postures and motion dynamics for\nunsupervised skeleton-based action recognition. It mainly consists of three\nparts: Sequence Reconstructor, Contrastive Motion Learner, and Information\nFuser. The Sequence Reconstructor learns representation from skeleton\ncoordinate sequence via reconstruction, thus the learned representation tends\nto focus on trivial postural coordinates and be hesitant in motion learning. To\nenhance the learning of motions, the Contrastive Motion Learner performs\ncontrastive learning between the representations learned from coordinate\nsequence and additional velocity sequence, respectively. Finally, in the\nInformation Fuser, we explore varied strategies to combine the Sequence\nReconstructor and Contrastive Motion Learner, and propose to capture postures\nand motions simultaneously via a knowledge-distillation based fusion strategy\nthat transfers the motion learning from the Contrastive Motion Learner to the\nSequence Reconstructor. Experimental results on several benchmarks, i.e., NTU\nRGB+D 60, NTU RGB+D 120, CMU mocap, and NW-UCLA, demonstrate the promise of the\nproposed CRRL method by far outperforming state-of-the-art approaches.\n

References

Page 1

	Year	Citations

Page 1