An End-to-End Architecture of Online Multi-Channel Speech Separation

Abstract

Although mask based adaptive beamforming technique benefits speech recognition in far-field, noisy and multi-talker scenarios, it depends on the long time context to estimate target and interference statistics, thus when applied in applications with low latency requirement, its performance usually drops drastically. In contrast, the fixed beamformers do not import time delay but usually have limited capability in acoustic cancellation of interfering source. In this work, we propose a novel multi-channel speech separation system that targets at overlapped speech recognition with low latency processing, which includes four jointly optimized components: a pre-separator, a set of fixed beamformer, an attentional selection module and neural post filtering. With proposed model, low latency processing is achieved by utilizing the known microphone geometry information, while keeps the high quality separation through neural post filtering and end-to-end optimization. In our experiments, we show that the proposed system achieves comparable performance in offline evaluation with the mask based MVDR and speech extraction system, while yield remarkable improvements in the online evaluation.