Improving Multiparty Interactions with a Robot Using Large Language Models

Abstract

Speaker diarization is a key component of systems that support multiparty interactions of co-located users, such as meeting facilitation robots. The goal is to identify who spoke what, often to provide feedback, moderate participation, and personalize responses by the robot. Current systems use a combination of acoustic (e.g. pitch differences) and visual features (e.g. gaze) to perform diarization, but involve the use of additional sensors or require overhead signal processing efforts. Alternatively, automatic speech recognition (ASR) is a necessary step in the diarization pipeline, and utilizing the transcribed text to directly identify speaker labels in the conversation can eliminate such challenges. With that motivation, we leverage large language models (LLMs) to identify speaker labels from transcribed text and observe an exact match of 77% and a word level accuracy of 90%. We discuss our findings and the potential use of LLMs as a diarization tool for future systems.

References

Page 1

	Year	Citations

Page 1