Publication | Closed Access
Large vocabulary ASR for spontaneous czech in the MALACH project
29
Citations
4
References
2003
Year
Unknown Venue
EngineeringSpeech CorpusSpoken Language ProcessingLanguage LearningCorpus LinguisticsSpeech RecognitionNatural Language ProcessingApplied LinguisticsLanguage DocumentationComputational LinguisticsPhoneticsLanguage EngineeringSpontaneous Czech SpeechLanguage StudiesLexiconMachine TranslationLinguisticsLanguage TechnologyAutomatic TranscriptionLvcsr ResearchSpeech CommunicationLarge Vocabulary AsrLexical ResourceLanguage RecognitionSpeech ProcessingSpeech Translation
This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) (www.vhf.org) by advancing the state of the art in automated speech recognition. We describe a baseline ASR system and discuss the problems in language modeling that arise from the nature of Czech as a highly inflectional language that also exhibits diglossia between its written and spontaneous forms. The difficulties of this task are compounded by heavily accented, emotional and disfluent speech along with frequent switching between languages. To overcome the limited amount of relevant language model data we use statistical techniques for selecting an appropriate training corpus from a large unstructured text collection resulting in significant reductions in word error rate.
| Year | Citations | |
|---|---|---|
Page 1
Page 1