Publication | Open Access
The design for the wall street journal-based CSR corpus
1.1K
Citations
8
References
1992
Year
Unknown Venue
EngineeringSpeech CorpusText Processing StepsSpoken Language ProcessingCorpus LinguisticsText MiningSpeech RecognitionNatural Language ProcessingInformation RetrievalData ScienceSignificant Speech CorporaComputational LinguisticsMachine TranslationHealth SciencesLinguisticsInformation ExtractionSpeech AnalysisText ProcessingCsr CorpusLanguage CorpusSpeech ProcessingSpeech InputSpeech PerceptionSpeech Interface
The DARPA Spoken Language System community has led the development of large speech corpora, and the Wall Street Journal CSR Corpus is the latest addition to this collection. The paper outlines the goals and design of the WSJ CSR Corpus, detailing acoustic data, text processing, lexicons, and testing paradigms. The WSJ corpus offers 400 hours of speech and 47 million words of text, enabling integration of speech recognition and natural language processing in high‑value application domains.
The DARPA Spoken Language System (SLS) community has long taken a leadership position in designing, implementing, and globally distributing significant speech corpora widely used for advancing speech recognition research. The Wall Street Journal (WSJ) CSR Corpus described here is the newest addition to this valuable set of resources. In contrast to previous corpora, the WSJ corpus will provide DARPA its first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby providing a means to integrate speech recognition and natural language processing in application domains with high potential practical value. This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus.
| Year | Citations | |
|---|---|---|
Page 1
Page 1