Publication | Open Access
Language model based arabic word segmentation
140
Citations
11
References
2003
Year
Unknown Venue
EngineeringMultilingual PretrainingLarge Language ModelCorpus LinguisticsText MiningSpeech RecognitionNatural Language ProcessingLanguage DocumentationArabicText SegmentationComputational LinguisticsLanguage EngineeringStemmingLanguage StudiesMachine TranslationMorphologyComputer ScienceTrigram Language ModelSegmentation AccuracyLanguage RecognitionText ProcessingLinguisticsPo Tagging
Arabic morphology is modeled as a word composed of zero or more prefixes, a stem, and zero or more suffixes. The system is bootstrapped from a small manually segmented corpus and employs a trigram language model trained on 110,000 words, then expanded with automatically acquired stems from a 155‑million‑word unsegmented corpus to refine the model. The segmenter attains about 97 % exact‑match accuracy on a 28,449‑token test set, representing state‑of‑the‑art performance and demonstrating applicability to other highly inflected languages with a small seed corpus.
We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.
| Year | Citations | |
|---|---|---|
Page 1
Page 1