Publication | Open Access
Inducing multilingual text analysis tools via robust projection across aligned corpora
494
Citations
16
References
2001
Year
Unknown Venue
Syntactic ParsingEngineeringCross-lingual RepresentationMultilingualismPart-of-speech TaggingCorpus LinguisticsText MiningSpeech RecognitionNatural Language ProcessingApplied LinguisticsSyntaxData ScienceComputational LinguisticsLanguage EngineeringLanguage StudiesMachine TranslationNlp TaskOptimal AlignmentsCross-language RetrievalBase Noun-phrase BracketersTreebanksTag AccuracyLanguage CorpusLinguisticsRobust ProjectionPo Tagging
Direct annotation projection is highly noisy even with optimal alignments. The paper introduces noise‑robust training procedures to automatically induce monolingual POS taggers, noun‑phrase bracketers, named‑entity taggers, and morphological analyzers for any foreign language. The approach applies existing English text‑analysis tools to bilingual corpora, projects their output onto the target language via statistical word alignments, and trains the induced tools from these noisy projections. Induced tools achieve 96 % POS accuracy and 91 % noun‑phrase F‑measure on French, with a morphological analyzer reaching 99 % lemmatization accuracy, all without hand‑annotated data and outperforming direct projection.
This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish.Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite noisy, however, even with optimal alignments. Thus this paper presents noise-robust tagger, bracketer and lemmatizer training procedures capable of accurate system bootstrapping from noisy and incomplete initial projections.Performance of the induced stand-alone part-of-speech tagger applied to French achieves 96% core part-of-speech (POS) tag accuracy, and the corresponding induced noun-phrase bracketer exceeds 91% F-measure. The induced morphological analyzer achieves over 99% lemmatization accuracy on the complete French verbal system.This achievement is particularly noteworthy in that it required absolutely no hand-annotated training data in the given language, and virtually no language-specific knowledge or resources beyond raw text. Performance also significantly exceeds that obtained by direct annotation projection.
| Year | Citations | |
|---|---|---|
Page 1
Page 1