Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation

TLDR

The study focuses on inference in a data‑driven, object‑based model of multichannel audio, where convolutive mixtures are approximated as linear instantaneous mixing in the STFT domain and each source is modeled using nonnegative matrix factorization with an Itakura‑Saito divergence‑based Gaussian statistical model. The authors aim to estimate mixing and source parameters in this model using two distinct approaches. They employ an expectation‑maximization algorithm that maximizes the exact joint likelihood and a multiplicative‑update scheme that maximizes the sum of individual channel likelihoods, applying both to stereo source separation across blind/supervised, music/speech, synthetic and real recordings. The EM approach achieves competitive performance on two SiSEC 2008 tasks, matching state‑of‑the‑art results.

Abstract

We consider inference in a general data-driven object-based model of multichannel audio data, assumed generated as a possibly underdetermined convolutive mixture of source signals. We work in the short-time Fourier transform (STFT) domain, where convolution is routinely approximated as linear instantaneous mixing in each frequency band. Each source STFT is given a model inspired from nonnegative matrix factorization (NMF) with the Itakura-Saito divergence, which underlies a statistical model of superimposed Gaussian components. We address estimation of the mixing and source parameters using two methods. The first one consists of maximizing the exact joint likelihood of the multichannel data using an expectation-maximization (EM) algorithm. The second method consists of maximizing the sum of individual likelihoods of all channels using a multiplicative update algorithm inspired from NMF methodology. Our decomposition algorithms are applied to stereo audio source separation in various settings, covering blind and supervised separation, music and speech sources, synthetic instantaneous and convolutive mixtures, as well as professionally produced music recordings. Our EM method produces competitive results with respect to state-of-the-art as illustrated on two tasks from the international Signal Separation Evaluation Campaign (SiSEC 2008).

References

Page 1

	Year	Citations

Page 1