Panoptic Studio: A Massively Multiview System for Social Interaction Capture

TLDR

Capturing social interactions is difficult due to frequent occlusion, subtle motion over large spaces, vast appearance variation, and marker attachment bias. The study introduces a modular system to capture the 3D motion of groups engaged in social interactions without markers. The Panoptic Studio uses 480 synchronized cameras to fuse weak perceptual cues across many views, generating skeletal proposals and refining them temporally to produce labeled, time‑varying 3D anatomical landmarks for each individual. The system successfully reconstructs full‑body motion of more than five people in marker‑free social interactions, and experiments show that increasing the number of views improves reconstruction quality.

Abstract

We present an approach to capture the 3D motion of a group of people engaged in a social interaction. The core challenges in capturing social interactions are: (1) occlusion is functional and frequent; (2) subtle motion needs to be measured over a space large enough to host a social group; (3) human appearance and configuration variation is immense; and (4) attaching markers to the body may prime the nature of interactions. The Panoptic Studio is a system organized around the thesis that social interactions should be measured through the integration of perceptual analyses over a large variety of view points. We present a modularized system designed around this principle, consisting of integrated structural, hardware, and software innovations. The system takes, as input, 480 synchronized video streams of multiple people engaged in social activities, and produces, as output, the labeled time-varying 3D structure of anatomical landmarks on individuals in the space. Our algorithm is designed to fuse the "weak" perceptual processes in the large number of views by progressively generating skeletal proposals from low-level appearance cues, and a framework for temporal refinement is also presented by associating body parts to reconstructed dense 3D trajectory stream. Our system and method are the first in reconstructing full body motion of more than five people engaged in social interactions without using markers. We also empirically demonstrate the impact of the number of views in achieving this goal.