An overview of automatic speaker diarization systems

TLDR

Audio diarization annotates an audio channel by attributing temporal energy regions to specific sources such as speakers, music, or background noise, and is used to improve speech recognition, searchability of audio archives, and the readability of automatic transcriptions. The paper reviews current speaker diarization approaches, comparing their merits and limitations. The authors compare performance of these techniques within DARPA EARS Rich Transcription evaluations and examine their deployment in broadcast news and potential portability to meetings and speaker verification. The review shows that speaker diarization techniques are being integrated into broadcast news systems and can be ported to other domains such as meetings and speaker verification.

Abstract

Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification

References

Page 1

	Year	Citations

Page 1