Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

Abstract

Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both the neural bottleneck feature (BNF) approach and the information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are used as the attention query, which results from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally, the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments show that Expressive-VC is superior to several popular systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained.

References

Page 1

	Year	Citations

Page 1