Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis

TLDR

Deep neural networks learn complex mappings from text features to speech acoustic features, enabling text‑to‑speech synthesis, and recent studies show they produce more natural synthetic speech than conventional HMM‑based systems. The study aims to improve the hidden representation within a DNN by applying Multi‑Task Learning and to evaluate the effect of stacking multiple frames of hidden‑layer activations (stacked bottleneck features). The authors employ Multi‑Task Learning to refine the hidden representation and use stacked bottleneck features by concatenating multiple hidden‑layer activations across frames. Experimental results and listening tests show that stacked bottleneck features significantly improve performance over a baseline DNN and a benchmark HMM system.

Abstract

Deep neural networks (DNNs) use a cascade of hidden representations to enable the learning of complex mappings from input to output features. They are able to learn the complex mapping from text-based linguistic features to speech acoustic features, and so perform text-to-speech synthesis. Recent results suggest that DNNs can produce more natural synthetic speech than conventional HMM-based statistical parametric systems. In this paper, we show that the hidden representation used within a DNN can be improved through the use of Multi-Task Learning, and that stacking multiple frames of hidden layer activations (stacked bottleneck features) also leads to improvements. Experimental results confirmed the effectiveness of the proposed methods, and in listening tests we find that stacked bottleneck features in particular offer a significant improvement over both a baseline DNN and a benchmark HMM system.

References

Page 1

	Year	Citations

Page 1