Publication | Open Access
From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
33
Citations
0
References
2022
Year
End-to-end neural diarization (EEND) is nowadays one of the \nmost prominent research topics in speaker diarization. EEND \npresents an attractive alternative to standard cascaded diarization \nsystems since a single system is trained at once to deal \nwith the whole diarization problem. Several EEND variants \nand approaches are being proposed, however, all these models \nrequire large amounts of annotated data for training but available \nannotated data are scarce. Thus, EEND works have used \nmostly simulated mixtures for training. However, simulated \nmixtures do not resemble real conversations in many aspects. \nIn this work we present an alternative method for creating synthetic \nconversations that resemble real ones by using statistics \nabout distributions of pauses and overlaps estimated on genuine \nconversations. Furthermore, we analyze the effect of the \nsource of the statistics, different augmentations and amounts of \ndata. We demonstrate that our approach performs substantially \nbetter than the original one, while reducing the dependence on \nthe fine-tuning stage. Experiments are carried out on 2-speaker \ntelephone conversations of Callhome and DIHARD 3. Together \nwith this publication, we release our implementations of EEND \nand the method for creating simulated conversations. \nIndex Terms: speaker diarization, end-to-end neural diarization, \nsimulated conversations