Publication | Closed Access
Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition
215
Citations
15
References
2014
Year
Unknown Venue
Deep Neural NetworksSpeech SciencesMachine LearningVoiceBaum-welch StatisticsHealth SciencesEngineeringMulti-speaker Speech RecognitionLinguisticsSpeech AcousticsSpeaker DiarizationSpeech ProcessingDeep LearningTriphone StatesAcoustic AnalysisSpeech Signal AnalysisSpeaker RecognitionSpeech Recognition
The study investigates using deep neural networks to extract Baum‑Welch statistics for i‑vector‑based, text‑independent speaker recognition. The authors replace the EM‑trained universal background model with a DNN‑modeled triphone‑state posterior, combining these assignments with 60‑dim MFCCs to compute first‑order Baum‑Welch statistics for training the i‑vector extractor. Although the DNN‑derived i‑vectors perform worse alone, they provide complementary speaker information, yielding a 16 % relative gain when fused with standard i‑vectors, and a different DNN configuration achieved baseline‑level performance on NIST 2012 C2 (female).
We examine the use of Deep Neural Networks (DNN) in extracting Baum-Welch statistics for i-vector-based textindependent speaker recognition. Instead of training the universal background model using the standard EM algorithm, the components are predefined and correspond to the set of triphone states, the posterior occupancy probabilities of which are modeled by a DNN. Those assignments are then combined with the standard 60-dim MFCC features to calculate first order BaumWelch statistics in order to train the i-vector extractor and extract i-vectors. The DNN-based assignment force the i-vectors to capture the idiosyncratic way in which each speaker pronounces each particular triphone state, which can enrich the standard short-term spectral representation of the standard ivectors. After experimenting with Switchboard data and a baseline PLDA classifier, our results showed that although the proposed i-vectors yield inferior performance compared to the standard ones, they are capable of attaining 16% relative improvement when fused with them, meaning that they carry useful complementary information about the speaker’s identity. A further experiment with a different DNN configuration attained comparable performance with the baseline i-vectors on NIST 2012 (condition C2, female).
| Year | Citations | |
|---|---|---|
Page 1
Page 1