Publication | Closed Access
Using out-of-domain data to improve in-domain language models
60
Citations
7
References
1997
Year
Llm Fine-tuningEngineeringMachine LearningDomain-dependent TextWord Error RateMultilingual PretrainingSemanticsLarge Language ModelCorpus LinguisticsOut-of-domain DataText MiningSpeech RecognitionNatural Language ProcessingData ScienceComputational LinguisticsStandard Statistical LanguageLanguage StudiesMachine TranslationLarge Ai ModelSpeech ProcessingDomain ModelLinguistics
Standard statistical language modeling techniques suffer from sparse data problems when applied to real tasks in speech recognition, where large amounts of domain-dependent text are not available. We investigate new approaches to improve sparse application-specific language models by combining domain dependent and out-of-domain data, including a back-off scheme that effectively leads to context-dependent multiple interpolation weights, and a likelihood-based similarity weighting scheme to discriminatively use data to train a task-specific language model. Experiments with both approaches on a spontaneous speech recognition task (switchboard), lead to reduced word error rate over a domain-specific n-gram language model, giving a larger gain than that obtained with previous brute-force data combination approaches.
| Year | Citations | |
|---|---|---|
Page 1
Page 1