Publication | Open Access
A discriminative latent variable chinese segmenter with hybrid word/character information
45
Citations
19
References
2009
Year
Unknown Venue
EngineeringMachine LearningPart-of-speech TaggingText MiningChinese Word SegmentationNatural Language ProcessingSpeech RecognitionInformation RetrievalData ScienceHybrid Word/character InformationPattern RecognitionText RecognitionComputational LinguisticsWord Segmentation (Natural Language Processing)Text SegmentationCharacter SequencesLanguage StudiesCharacter RecognitionNamed-entity RecognitionMachine TranslationLong Range DependenciesWord Segmentation (Phonological Awareness)Nlp TaskLinguisticsPo Tagging
Chinese word segmentation has traditionally been treated as a character‑based tagging task, but recent semi‑Markov models incorporate complete‑word features. The study proposes a latent variable model that fuses word and character sequences to capture long‑range dependencies and improve recall for long words such as named entities. The model uses latent variables to fuse word and character sequences, allowing it to capture long‑range dependencies. Experiments confirm that the latent variable approach improves recall for long words, and the system ranks among the best on the second SIGHAN CWS bakeoff.
Conventional approaches to Chinese word segmentation treat the problem as a character-based tagging task. Recently, semi-Markov models have been applied to the problem, incorporating features based on complete words. In this paper, we propose an alternative, a latent variable model, which uses hybrid information based on both word sequences and character sequences. We argue that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities. Experimental results show that this is indeed the case. With this improvement, evaluations on the data of the second SIGHAN CWS bakeoff show that our system is competitive with the best ones in the literature.
| Year | Citations | |
|---|---|---|
Page 1
Page 1