Concepedia

Publication | Closed Access

Statistics-based segment pattern lexicon-a new direction for Chinese language modeling

16

Citations

3

References

2002

Year

Abstract

This paper presents a new direction for Chinese language modeling based on a different concept of the lexicon. Because every Chinese character has its own meaning and there are no "blanks" in Chinese sentences serving as word boundaries, also because the wording structure in the Chinese language is extremely flexible, the "words" in Chinese are actually not well defined, and there does not exist a commonly accepted lexicon. This makes language modeling very sophisticated in the Chinese language, and the "out of vocabulary (OOV)" problem specially serious. A new concept for the lexicon is thus proposed. The elements of this lexicon can be words or any other "segment patterns". They should be extracted from the training corpus by statistical approaches with a goal to minimize the overall perplexity. The language models can then be developed based on this new lexicon. Very encouraging experimental results have been obtained.

References

YearCitations

Page 1