Publication | Open Access
A maximum entropy approach to identifying sentence boundaries
393
Citations
8
References
1997
Year
Unknown Venue
EngineeringPart-of-speech TaggingCorpus LinguisticsText MiningNatural Language ProcessingSyntaxComputational LinguisticsLanguage EngineeringGrammarLanguage StudiesMachine TranslationNlp TaskTrainable ModelRaw TextInformation ExtractionSemantic ParsingSentence BoundariesLinguisticsMaximum Entropy ApproachPo Tagging
The authors present a trainable model for detecting sentence boundaries in raw text. The model learns to classify punctuation marks as valid or invalid sentence boundaries from annotated corpora, requiring no hand‑crafted rules or domain‑specific resources, and can be trained on any English genre or other Roman‑alphabet language. The system achieves performance comparable to or better than similar systems, highlighting its ease of retraining for new domains.
We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Romanalphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.
| Year | Citations | |
|---|---|---|
Page 1
Page 1