Publication | Closed Access
Interference Reduction in Reverberant <newline/>Speech Separation With Visual <newline/>Voice Activity Detection
15
Citations
32
References
2014
Year
Source SeparationEngineeringSpeech IntelligibilitySpeech EnhancementSpeech RecognitionPhoneticsAudio MixturesNoiseRobust Speech RecognitionSpeech MixturesHealth SciencesDistant Speech RecognitionSignal ProcessingSpeech CommunicationVoice Activity DetectionVoiceVisual ModalityMulti-speaker Speech RecognitionSpeech AcousticsSpeech ProcessingSpeech SeparationInterference ReductionSpeech PerceptionSignal Separation
The visual modality, deemed to be complementary to the audio modality, has recently been exploited to improve the performance of blind source separation (BSS) of speech mixtures, especially in adverse environments where the performance of audio-domain methods deteriorates steadily. In this paper, we present an enhancement method to audio-domain BSS with the integration of voice activity information, obtained via a visual voice activity detection (VAD) algorithm. Mimicking aspects of human hearing, binaural speech mixtures are considered in our two-stage system. Firstly, in the off-line training stage, a speaker-independent voice activity detector is formed using the visual stimuli via the adaboosting algorithm. In the on-line separation stage, interaural phase difference (IPD) and interaural level difference (ILD) cues are statistically analyzed to assign probabilistically each time-frequency (TF) point of the audio mixtures to the source signals. Next, the detected voice activity cues (found via the visual VAD) are integrated to reduce the interference residual. Detection of the interference residual takes place gradually, with two layers of boundaries in the correlation and energy ratio map. We have tested our algorithm on speech mixtures generated using room impulse responses at different reverberation times and noise levels. Simulation results show performance improvement of the proposed method for target speech extraction in noisy and reverberant environments, in terms of signal-to-interference ratio (SIR) and perceptual evaluation of speech quality (PESQ).
| Year | Citations | |
|---|---|---|
Page 1
Page 1