ESPnet-se: end-to-end speech enhancement and separation toolkit designed\n for asr integration

Abstract

We present ESPnet-SE, which is designed for the quick development of speech\nenhancement and speech separation systems in a single framework, along with the\noptional downstream speech recognition module. ESPnet-SE is a new project which\nintegrates rich automatic speech recognition related models, resources and\nsystems to support and validate the proposed front-end implementation (i.e.\nspeech enhancement and separation).It is capable of processing both\nsingle-channel and multi-channel data, with various functionalities including\ndereverberation, denoising and source separation. We provide all-in-one recipes\nincluding data pre-processing, feature extraction, training and evaluation\npipelines for a wide range of benchmark datasets. This paper describes the\ndesign of the toolkit, several important functionalities, especially the speech\nrecognition integration, which differentiates ESPnet-SE from other open source\ntoolkits, and experimental results with major benchmark datasets.\n

References

Page 1

	Year	Citations
PyTorch: An Imperative Style, High-Performance Deep Learning Library Adam Paszke, Sam Gross, Francisco Massa, arXiv (Cornell University) Artificial IntelligenceEngineeringMachine LearningData ScienceHardware Acceleration	2019	16.2K
Librispeech: An ASR corpus based on public domain audio books Vassil Panayotov, Guoguo Chen, Daniel Povey, EngineeringSpeech CorpusSpoken Language ProcessingCorpus LinguisticsSpeech Recognition	2015	5.7K
Some Experiments on the Recognition of Speech, with One and with Two Ears E. Colin Cherry The Journal of the Acoustical Society of America EngineeringSpeech AnalysisPhoneticsSpeech SignalsNoise	1953	4.5K
Image method for efficiently simulating small-room acoustics Jont B. Allen, D. A. Berkley The Journal of the Acoustical Society of America AeroacousticsImpulse ResponseEngineeringImage MethodNoise	1979	3.7K
Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs Antony W. Rix, John G. Beerends, M. P. Hollier, EngineeringSound QualitySpeech EnhancementPerceptual EvaluationCommunication	2002	3K
Performance measurement in blind audio source separation Emmanuel Vincent, Rémi Gribonval, Cédric Févotte IEEE Transactions on Audio Speech and Language Processing Source SeparationEngineeringHealth SciencesTrue Source PartAudio Signal Processing	2006	2.9K
An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech Cees Taal, Richard C. Hendriks, Richard Heusdens, IEEE Transactions on Audio Speech and Language Processing Objective Intelligibility MeasureEngineeringSpeech IntelligibilitySpeech EnhancementPhonology	2011	2.2K
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation Yi Luo, Nima Mesgarani IEEE/ACM Transactions on Audio Speech and Language Processing	2019	1.9K
Deep clustering: Discriminative embeddings for segmentation and separation John R. Hershey, Zhuo Chen, Jonathan Le Roux, Source SeparationSingle-channel MixturesEngineeringMachine LearningUnsupervised Machine Learning	2016	1.4K
ESPnet: End-to-End Speech Processing Toolkit Shinji Watanabe, Takaaki Hori, Shigeki Karita, Software PlatformEngineeringMachine LearningMajor Asr BenchmarksNatural Language Processing	2018	1.3K

Page 1