Concepedia

Publication | Open Access

Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances

39

Citations

26

References

2020

Year

Abstract

Currently, the most widely used approach for speaker verification is the deep\nspeaker embedding learning. In this approach, we obtain a speaker embedding\nvector by pooling single-scale features that are extracted from the last layer\nof a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes\nmulti-scale features from different layers of the feature extractor, has\nrecently been introduced and shows superior performance for variable-duration\nutterances. To increase the robustness dealing with utterances of arbitrary\nduration, this paper improves the MSA by using a feature pyramid module. The\nmodule enhances speaker-discriminative information of features from multiple\nlayers via a top-down pathway and lateral connections. We extract speaker\nembeddings using the enhanced features that contain rich speaker information\nwith different time scales. Experiments on the VoxCeleb dataset show that the\nproposed module improves previous MSA methods with a smaller number of\nparameters. It also achieves better performance than state-of-the-art\napproaches for both short and long utterances.\n

References

YearCitations

Page 1