Publication | Open Access
XenC: An Open-Source Tool for Data Selection in Natural Language Processing
70
Citations
7
References
2013
Year
EngineeringCorpus LinguisticsLanguage ProcessingText MiningSpeech RecognitionNatural Language ProcessingInformation RetrievalData ScienceData MiningComputational LinguisticsLanguage StudiesNamed-entity RecognitionMachine TranslationNlp TaskLinguisticsKnowledge DiscoveryComputer ScienceInformation ExtractionOpen-source ToolData SelectionText ProcessingAsr SystemSpeech Translation
Abstract In this paper we describe XenC, an open-source tool for data selection aimed at Natural Language Processing (NLP) in general and Statistical Machine Translation (SMT) or Automatic Speech Recognition (ASR) in particular. Usually, when building a SMT or ASR system, the considered task is related to a specific domain of application, like news articles or scientific talks for instance. The goal of XenC is to allow selection of relevant data regarding the considered task, which will be used to build the statistical models for such a system. It is done by computing the difference between cross-entropy scores of sentences from a large out-of-domain corpus and sentences from a corpus considered as in-domain for the task. Written in C++, this tool can operate on monolingual or bilingual data and is language-independent. XenC, now part of the LIUM toolchain for SMT, is actively developed since December 2011 and used in many MT projects.
| Year | Citations | |
|---|---|---|
Page 1
Page 1