On Training Targets for Supervised Speech Separation

TLDR

Speech separation as a supervised learning problem has shown promise, traditionally using the ideal binary mask for its simplicity and intelligibility gains, though the framework is not limited to binary targets. The study evaluates and compares separation results using various training targets such as IBM, target binary mask, IRM, FFT‑MASK, and Gammatone frequency power spectrum. A deep neural network is trained to map noisy features to a time‑frequency representation of the target, evaluating multiple training targets. The results show that the IRM and FFT‑MASK targets outperform others in objective intelligibility and quality, masking‑based targets generally beat spectral envelope targets, and supervised speech separation clearly outperforms recent non‑negative matrix factorization and speech enhancement methods.

Abstract

Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.

References

Page 1

	Year	Citations

Page 1