A discriminative latent variable chinese segmenter with hybrid word/character information

TLDR

Chinese word segmentation has traditionally been treated as a character‑based tagging task, but recent semi‑Markov models incorporate complete‑word features. The study proposes a latent variable model that fuses word and character sequences to capture long‑range dependencies and improve recall for long words such as named entities. The model uses latent variables to fuse word and character sequences, allowing it to capture long‑range dependencies. Experiments confirm that the latent variable approach improves recall for long words, and the system ranks among the best on the second SIGHAN CWS bakeoff.

Abstract

Conventional approaches to Chinese word segmentation treat the problem as a character-based tagging task. Recently, semi-Markov models have been applied to the problem, incorporating features based on complete words. In this paper, we propose an alternative, a latent variable model, which uses hybrid information based on both word sequences and character sequences. We argue that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities. Experimental results show that this is indeed the case. With this improvement, evaluations on the data of the second SIGHAN CWS bakeoff show that our system is competitive with the best ones in the literature.

References

Page 1

	Year	Citations

Page 1