Concepedia

Publication | Open Access

LT TTT - A Flexible Tokenisation Tool

94

Citations

9

References

2000

Year

Abstract

We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation of a corpus in the medical domain. We conclude with a discussion of the use of browsers to visualise marked-up text. 1. Introduction The LTG's Text Tokenisation Toolkit (LT TTT, Grover et al., 1999) was developed within an XML processing paradigm whereby tools are combined together in a pipeline allowing each to add, modify or remove some piece of mark-up. The tools are compatible with the LT XML toolset (Thompson et al., 1997) and use the LT XML API to manipulate attribute values and character data ...

References

YearCitations

Page 1