Publication | Closed Access
WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition
356
Citations
3
References
2002
Year
Unknown Venue
EngineeringSpeech CorpusSpoken Language ProcessingCorpus LinguisticsSpeech RecognitionNatural Language ProcessingLanguage DocumentationComputational LinguisticsPhoneticsRobust Speech RecognitionVoice RecognitionLanguage StudiesMachine TranslationUtterance TranscriptionsBritish EnglishSpoken British EnglishSpeech CommunicationSpeech AnalysisLanguage RecognitionSpeech ProcessingSpeech InputLinguistics
WSJCAMO, derived from the Wall Street Journal text corpus, is one of the largest spoken British English corpora available. The corpus was created to support speaker‑independent speech recognition system development and evaluation, and this paper details its motivation, construction processes, and required utilities. It comprises 140 speakers delivering about 110 utterances each, with verified transcriptions, a phonetic dictionary, and two evaluation tasks using 5,000‑word bigram and 20,000‑word trigram language models. The paper reports comparative results on these tasks for British and American English, demonstrating the corpus’s utility for cross‑dialect evaluation.
A significant new speech corpus of British English has been recorded at Cambridge University. Derived from the Wall Street Journal text corpus, WSJCAMO constitutes one of the largest corpora of spoken British English currently in existence. It has been specifically designed for the construction and evaluation of speaker-independent speech recognition systems. The database consists of 140 speakers each speaking about 110 utterances. This paper describes the motivation for the corpus, the processes undertaken in its construction and the utilities needed as support tools. All utterance transcriptions have been verified and a phonetic dictionary has been developed to cover the training data and evaluation tasks. Two evaluation tasks have been defined using standard 5000 word bigram and 20000 word trigram language models. The paper concludes with comparative results on these tasks for British and American English.
| Year | Citations | |
|---|---|---|
Page 1
Page 1