End-to-End Domain-Adversarial Voice Activity Detection

Abstract

Voice activity detection is the task of detecting speech regions in a given\naudio stream or recording. First, we design a neural network combining\ntrainable filters and recurrent layers to tackle voice activity detection\ndirectly from the waveform. Experiments on the challenging DIHARD dataset show\nthat the proposed end-to-end model reaches state-of-the-art performance and\noutperforms a variant where trainable filters are replaced by standard cepstral\ncoefficients. Our second contribution aims at making the proposed voice\nactivity detection model robust to domain mismatch. To that end, a domain\nclassification branch is added to the network and trained in an adversarial\nmanner. The same DIHARD dataset, drawn from 11 different domains is used for\nevaluation under two scenarios. In the in-domain scenario where the training\nand test sets cover the exact same domains, we show that the domain-adversarial\napproach does not degrade performance of the proposed end-to-end model. In the\nout-domain scenario where the test domain is different from training domains,\nit brings a relative improvement of more than 10%. Finally, our last\ncontribution is the provision of a fully reproducible open-source pipeline than\ncan be easily adapted to other datasets.\n

References

Page 1

	Year	Citations

Page 1