Use of a Full Parser for Information Extraction in Molecular Biology Domain

Abstract

There is an increasing need for automatic information extraction (IE) to support database building and to intelligently find novel knowledge of biological events from online journal collections. Many of the previous researchers (e.g., [3]) extracted such information by using hand-tailored patterns in regular expressions on some pre-defined set of verbs representing a certain type of reaction. However, as a fact can be represented in various forms in natural language text, many patterns of surface expressions need to be prepared for one event. We propose an alternative information extraction method based on full parsing with a large-scale, general-purpose grammar. In our system, a parser converts the variety of sentences that describe the same event into a canonical structure (argument structure) regarding the verb representing the event and its arguments such as (semantic) subject and object. Information extraction itself is done using pattern matching on the canonical structure. Since the variation of representation is absorbed by the parser, a relatively small number of patterns are required for extracting an event. In the current work, we have designed and implemented an argument extractor using a full parser to investigate the plausibility of full analysis of text using general-purpose parser and grammar applied to biomedical domain. We introduce two preprocessors to solve the problem of full parsers. One is a term recognizer (e.g., [1]) that glues the words in a noun phrase into one chunk so that the parser can handle them as if it is one word. The other is a shallow parser that reduces the lexical ambiguity. Thus, we partially solve the problems of full parsing of inefficiency and ambiguity We also propose the use of modules that handles partial results of parsing for overcoming the low coverage problem.

References

Page 1

	Year	Citations

Page 1