Publication | Open Access
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
39
Citations
26
References
2020
Year
Unknown Venue
Currently, the most widely used approach for speaker verification is the deep\nspeaker embedding learning. In this approach, we obtain a speaker embedding\nvector by pooling single-scale features that are extracted from the last layer\nof a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes\nmulti-scale features from different layers of the feature extractor, has\nrecently been introduced and shows superior performance for variable-duration\nutterances. To increase the robustness dealing with utterances of arbitrary\nduration, this paper improves the MSA by using a feature pyramid module. The\nmodule enhances speaker-discriminative information of features from multiple\nlayers via a top-down pathway and lateral connections. We extract speaker\nembeddings using the enhanced features that contain rich speaker information\nwith different time scales. Experiments on the VoxCeleb dataset show that the\nproposed module improves previous MSA methods with a smaller number of\nparameters. It also achieves better performance than state-of-the-art\napproaches for both short and long utterances.\n
| Year | Citations | |
|---|---|---|
Page 1
Page 1