WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition

TLDR

WSJCAMO, derived from the Wall Street Journal text corpus, is one of the largest spoken British English corpora available. The corpus was created to support speaker‑independent speech recognition system development and evaluation, and this paper details its motivation, construction processes, and required utilities. It comprises 140 speakers delivering about 110 utterances each, with verified transcriptions, a phonetic dictionary, and two evaluation tasks using 5,000‑word bigram and 20,000‑word trigram language models. The paper reports comparative results on these tasks for British and American English, demonstrating the corpus’s utility for cross‑dialect evaluation.

Abstract

A significant new speech corpus of British English has been recorded at Cambridge University. Derived from the Wall Street Journal text corpus, WSJCAMO constitutes one of the largest corpora of spoken British English currently in existence. It has been specifically designed for the construction and evaluation of speaker-independent speech recognition systems. The database consists of 140 speakers each speaking about 110 utterances. This paper describes the motivation for the corpus, the processes undertaken in its construction and the utilities needed as support tools. All utterance transcriptions have been verified and a phonetic dictionary has been developed to cover the training data and evaluation tasks. Two evaluation tasks have been defined using standard 5000 word bigram and 20000 word trigram language models. The paper concludes with comparative results on these tasks for British and American English.

References

Page 1

	Year	Citations

Page 1