Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances

Abstract

Currently, the most widely used approach for speaker verification is the deep\nspeaker embedding learning. In this approach, we obtain a speaker embedding\nvector by pooling single-scale features that are extracted from the last layer\nof a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes\nmulti-scale features from different layers of the feature extractor, has\nrecently been introduced and shows superior performance for variable-duration\nutterances. To increase the robustness dealing with utterances of arbitrary\nduration, this paper improves the MSA by using a feature pyramid module. The\nmodule enhances speaker-discriminative information of features from multiple\nlayers via a top-down pathway and lateral connections. We extract speaker\nembeddings using the enhanced features that contain rich speaker information\nwith different time scales. Experiments on the VoxCeleb dataset show that the\nproposed module improves previous MSA methods with a smaller number of\nparameters. It also achieves better performance than state-of-the-art\napproaches for both short and long utterances.\n

References

Page 1

	Year	Citations

Page 1