Permutation invariant training of deep models for speaker-independent multi-talker speech separation

TLDR

The authors introduce permutation invariant training (PIT) to address speaker‑independent multi‑talker speech separation, the cocktail‑party problem. PIT directly minimizes separation error, unlike multi‑class regression or deep clustering, providing a novel training criterion. PIT effectively resolves the label‑permutation issue, outperforming NMF, CASA, and DPCL on WSJ0 and Danish tasks, generalizes to unseen speakers and languages, and is simple to implement and extend.

Abstract

We propose a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem. Different from the multi-class regression technique and the deep clustering (DPCL) technique, our novel approach minimizes the separation error directly. This strategy effectively solves the long-lasting label permutation problem, that has prevented progress on deep learning based techniques for speech separation. We evaluated PIT on the WSJ0 and Danish mixed-speech separation tasks and found that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages. Since PIT is simple to implement and can be easily integrated and combined with other advanced techniques, we believe improvements built upon PIT can eventually solve the cocktail-party problem.

References

Page 1

	Year	Citations
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups Geoffrey E. Hinton, Li Deng, Dong Yu, IEEE Signal Processing Magazine EngineeringMachine LearningAcoustic ModelingSpeech RecognitionData Science	2012	10.2K
Some Experiments on the Recognition of Speech, with One and with Two Ears E. Colin Cherry The Journal of the Acoustical Society of America EngineeringSpeech AnalysisPhoneticsSpeech SignalsNoise	1953	4.5K
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition George E. Dahl, Dong Yu, Li Deng, IEEE Transactions on Audio Speech and Language Processing Large-vocabulary Speech RecognitionMachine LearningEngineeringDeep Belief NetworksPhone Recognition	2011	3.1K
Performance measurement in blind audio source separation Emmanuel Vincent, Rémi Gribonval, Cédric Févotte IEEE Transactions on Audio Speech and Language Processing Source SeparationEngineeringHealth SciencesTrue Source PartAudio Signal Processing	2006	2.9K
Deep clustering: Discriminative embeddings for segmentation and separation John R. Hershey, Zhuo Chen, Jonathan Le Roux, Source SeparationSingle-channel MixturesEngineeringMachine LearningUnsupervised Machine Learning	2016	1.4K
Factorial Hidden Markov Models Zoubin Ghahramani, Michael I. Jordan Machine Learning	1997	1.2K
On Training Targets for Supervised Speech Separation Yuxuan Wang, Arun Narayanan, DeLiang Wang IEEE/ACM Transactions on Audio Speech and Language Processing Source SeparationEngineeringMachine LearningSpeech EnhancementSpeech Recognition	2014	1.1K
An Experimental Study on Speech Enhancement Based on Deep Neural Networks Yong Xu, Jun Du, Li-Rong Dai, IEEE Signal Processing Letters EngineeringMachine LearningSpeech EnhancementSpeech RecognitionNoise	2013	944
Conversational speech transcription using context-dependent deep neural networks Frank Seide, Gang Li, Dong Yu Conversational Speech TranscriptionSpeech OutputSpeech ProcessingSpoken Language ProcessingLanguage Studies	2011	876
Auditory Scene Analysis: The Perceptual Organization of Sound Gregory Kramer, Albert S. Bregman Leonardo Music Journal MusicPsychoacousticsCognitive ScienceAuditory Scene AnalysisAuditory Modeling	1992	684

Page 1