Concepedia

TLDR

This study presents recent ESPnet toolkit developments featuring the Conformer architecture and aims to ease research by releasing all‑in‑one recipes and reducing resource burdens. The authors develop ESPnet with the Conformer architecture and plan to release all‑in‑one recipes built on open‑source corpora and pre‑trained models. Experiments show that Conformer‑based ESPnet achieves competitive or superior performance across ASR, ST, SS, and TTS, with notable training tips and benefits.

Abstract

In this study, we present recent developments on ESPnet: End-to- End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end- to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.

References

YearCitations

Page 1