Active Speakers in Context

TLDR

Active speaker detection methods typically model audiovisual data from a single speaker, which works for single‑speaker scenarios but fails to accurately identify speaking individuals among multiple candidates. The study proposes the Active Speaker Context, a representation that captures relationships among multiple speakers over extended time periods. The model learns pairwise and temporal relations among speakers from a structured ensemble of audiovisual observations. Experiments demonstrate that the structured feature ensemble improves active speaker detection, with the Active Speaker Context achieving an mAP of 87.1% on AVA‑ActiveSpeaker and ablation studies confirming the benefit of long‑term multi‑speaker analysis.

Abstract

Current methods for active speaker detection focus on modeling audiovisual information from a single speaker. This strategy can be adequate for addressing single-speaker scenarios, but it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our new model learns pairwise and temporal relations from a structured ensemble of audiovisual observations. Our experiments show that a structured feature ensemble already benefits active speaker detection performance. We also find that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving an mAP of 87.1%. Moreover, ablation studies verify that this result is a direct consequence of our long-term multi-speaker analysis.

References

Page 1

	Year	Citations

Page 1