Concepedia

TLDR

Speech synthesis is increasingly treated as a machine‑learning task thanks to advances in deep learning. The authors aim to accelerate speech‑synthesis research by creating an accessible Japanese multi‑speaker corpus, the JVS corpus, and by describing its design and specifications. They built the JVS corpus by collecting 30 hours of speech from 100 speakers in normal, whisper, and falsetto styles, extending the earlier 10‑hour JSUT corpus. The resulting corpus comprises 30 hours of speech, 22 hours of parallel normal voices, and is publicly available online.

Abstract

Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, we released the JSUT corpus, which contains 10 hours of reading-style speech uttered by a single speaker, for end-to-end text-to-speech synthesis. For more general use in speech synthesis research, e.g., voice conversion and multi-speaker modeling, in this paper, we construct the JVS corpus, which contains voice data of 100 speakers in three styles (normal, whisper, and falsetto). The corpus contains 30 hours of voice data including 22 hours of parallel normal voices. This paper describes how we designed the corpus and summarizes the specifications. The corpus is available at our project page.

References

YearCitations

Page 1