Teacher-Student Training for Text-Independent Speaker Recognition

Abstract

This paper investigates text-independent speaker recognition using neural embedding extractors based on the time-delay neural network. Our primary focus is to explore the teacher-student (TS) training framework for knowledge distillation in a text-independent (TI) speaker recognition task. We report the results on both proprietary and public benchmarks, obtaining competitive results with 88-93% smaller models. Particularly, in clean testing conditions, we find TS training on neural-based TI systems achieved same or better performance than the i-vector based counterparts. Neural embeddings are less prone to short segment issues, and offer better performance particularly in the high-recall setting. They can also provide some additional insights about speakers, such as gender or how difficult a given speaker can be for recognition.

References

Page 1

	Year	Citations

Page 1