Publication | Open Access
Unsupervised learning of Arabic stemming using a parallel corpus
53
Citations
9
References
2003
Year
Unknown Venue
EngineeringEnglish StemmerMultilingual PretrainingCorpus LinguisticsText MiningNatural Language ProcessingLanguage DocumentationInformation RetrievalData ScienceArabicComputational LinguisticsLanguage EngineeringLanguage StudiesNamed-entity RecognitionParallel CorpusMachine TranslationProprietary Arabic StemmerNlp TaskTerminology ExtractionLanguage CorpusText ProcessingProprietary StemmerLinguistics
This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10 K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. Examples and results will be given for Arabic, but the approach is applicable to any language that needs affix removal. Our resource-frugal approach results in 87.5% agreement with a state of the art, proprietary Arabic stemmer built using rules, affix lists, and human annotated text, in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38% in average precision over unstemmed text, and 96% of the performance of the proprietary stemmer above.
| Year | Citations | |
|---|---|---|
Page 1
Page 1