Publication | Closed Access
Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks
843
Citations
43
References
2017
Year
Source SeparationEngineeringMachine LearningMultitalker Speech SeparationSpeech EnhancementSpeech RecognitionPhoneticsSpeaker DiarizationRobust Speech RecognitionPermutation Invariant TrainingHealth SciencesDeep LearningDeep ClusteringDistant Speech RecognitionSpeech CommunicationMulti-speaker Speech RecognitionThree-speaker Speech MixturesSpeech SeparationSpeech ProcessingSpeech Perception
uPIT is a practically applicable, end‑to‑end, deep‑learning solution for speaker‑independent multitalker speech separation. The paper proposes the utterance‑level permutation invariant training (uPIT) technique. uPIT extends PIT with an utterance‑level cost function and uses RNNs that minimize utterance‑level separation error, aligning frames of the same speaker to the same output stream and eliminating the need for permutation resolution at inference. uPIT enables RNNs to separate multitalker speech without prior knowledge of duration, speaker count, identity, or gender, outperforms NMF and CASA baselines, compares favorably with deep clustering and attractor networks, generalizes to unseen speakers and languages, and a single model handles both two‑ and three‑speaker mixtures.
In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independent multitalker speech separation. Specifically, uPIT extends the recently proposed permutation invariant training (PIT) technique with an utterance-level cost function, hence eliminating the need for solving an additional permutation problem during inference, which is otherwise required by frame-level PIT. We achieve this using recurrent neural networks (RNNs) that, during training, minimize the utterance-level separation error, hence forcing separated frames belonging to the same speaker to be aligned to the same output stream. In practice, this allows RNNs, trained with uPIT, to separate multitalker mixed speech without any prior knowledge of signal duration, number of speakers, speaker identity, or gender. We evaluated uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that uPIT outperforms techniques based on nonnegative matrix factorization and computational auditory scene analysis, and compares favorably with deep clustering, and the deep attractor network. Furthermore, we found that models trained with uPIT generalize well to unseen speakers and languages. Finally, we found that a single model, trained with uPIT, can handle both two-speaker, and three-speaker speech mixtures.
| Year | Citations | |
|---|---|---|
1997 | 93.8K | |
2012 | 10.2K | |
2000 | 5.5K | |
2009 | 4.8K | |
1953 | 4.5K | |
2011 | 3.1K | |
2002 | 3K | |
2006 | 2.9K | |
2016 | 1.4K | |
1997 | 1.2K |
Page 1
Page 1