Language model based arabic word segmentation

TLDR

Arabic morphology is modeled as a word composed of zero or more prefixes, a stem, and zero or more suffixes. The system is bootstrapped from a small manually segmented corpus and employs a trigram language model trained on 110,000 words, then expanded with automatically acquired stems from a 155‑million‑word unsegmented corpus to refine the model. The segmenter attains about 97 % exact‑match accuracy on a 28,449‑token test set, representing state‑of‑the‑art performance and demonstrating applicability to other highly inflected languages with a small seed corpus.

Abstract

We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.

References

Page 1

	Year	Citations

Page 1