Publication | Closed Access
Active Speakers in Context
61
Citations
34
References
2020
Year
Unknown Venue
EngineeringMachine LearningActive Speaker ContextStructured EnsembleActive ListeningSpeech RecognitionNatural Language ProcessingData SciencePattern RecognitionAudiovisual InformationSpeaker DiarizationAudio AnalysisHealth SciencesActive SpeakersSpeech CommunicationMulti-speaker Speech RecognitionSpeech ProcessingSpeech PerceptionSpeaker Recognition
Active speaker detection methods typically model audiovisual data from a single speaker, which works for single‑speaker scenarios but fails to accurately identify speaking individuals among multiple candidates. The study proposes the Active Speaker Context, a representation that captures relationships among multiple speakers over extended time periods. The model learns pairwise and temporal relations among speakers from a structured ensemble of audiovisual observations. Experiments demonstrate that the structured feature ensemble improves active speaker detection, with the Active Speaker Context achieving an mAP of 87.1% on AVA‑ActiveSpeaker and ablation studies confirming the benefit of long‑term multi‑speaker analysis.
Current methods for active speaker detection focus on modeling audiovisual information from a single speaker. This strategy can be adequate for addressing single-speaker scenarios, but it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our new model learns pairwise and temporal relations from a structured ensemble of audiovisual observations. Our experiments show that a structured feature ensemble already benefits active speaker detection performance. We also find that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving an mAP of 87.1%. Moreover, ablation studies verify that this result is a direct consequence of our long-term multi-speaker analysis.
| Year | Citations | |
|---|---|---|
Page 1
Page 1