Syntax Annotation for the GENIA Corpus

Abstract

Linguistically annotated corpus based on texts in biomedical domain has been constructed to tune natural language processing (NLP) tools for biotextmining. As the focus of information extraction is shifting from &quot;nominal&quot; information such as named entity to &quot;verbal &quot; information such as function and interaction of substances, application of parsers has become one of the key technologies and thus the corpus annotated for syntactic structure of sentences is in demand. A subset of the GENIA corpus consisting of 500 MEDLINE abstracts has been annotated for syntactic structure in an XMLbased format based on Penn Treebank II (PTB) scheme. Inter-annotator agreement test indicated that the writing style rather than the contents of the research abstracts is the source of the difficulty in tree annotation, and that annotation can be stably done by linguists without much knowledge of biology with appropriate guidelines regarding to linguistic phenomena particular to scientific texts. 1

References

Page 1

	Year	Citations

Page 1