AUTOMATIC TERM EXTRACTION AND DOCUMENT SIMILARITY IN SPECIAL TEXT CORPORA

Abstract

This paper conflrms that the performance of a state-of-the-art automatic term extraction method on a computer science corpus is similar to previously published performance data on a medical corpus. The extracted terms are then used to estimate the similarity of papers in the computer science corpus using the standard Vector Space Model. The precision of retrieval using a term-based representation is compared with that of a word-based representation, and a link-based similarity metric based on the overlap of the local neighborhoods of the papers in the citation graph. The term-based approach ofiers comparable performance to the word-based approach, but potentially with a much smaller vocabulary size. Automatic term extraction in special text corpora is an interesting problem, which is becoming relevant as literature in speciflc scientiflc flelds such as medicine, biology and computer science explodes making it di‐cult to track the evolving terminology in the flelds [Kageura and Umino1996]. Early approaches to automatic term extraction were focused on information-theoretic approaches based on mutual information in detecting collocations [Manning and Schuetze1999]. Collocations are expressions that are composed of two or more words, the meaning of which is not easy to guess from the meanings of the component words. There are nuances in the detection of collocation that require linguistic criteria to resolve [Justeson and Katz1995]. Shallow linguistic criteria are based on acceptable sequences of part-of-speech tags. Part-of-speech tagging can be performed automatically [Brill1992]. A key problem is that of nesting, where subsets of consecutive words of terms consisting of multiple words would satisfy the statistical criteria for \termhood, but they would not be called terms. In the flrst part of this paper, we describe experiments with a state-of-the-art method, C-value/NC-value [Frantzi et al.2000], which combines statistical and linguistic information for automatic term extraction. We applied it to a special text corpus of computer science articles, which is of a difierent nature from the medical corpus on which the method was originally tested. We conflrmed that the performance of the method is equally good on our corpus, and we identifled some adjustments that the method required. In the second part of this paper, we use the terms extracted to estimate the similarity between two documents. We evaluate the quality of the similarity estimation based on terms in an information retrieval context. It is broadly believed that it is di‐cult to improve upon the bag-of-words representation as far as retrieval performance is concerned by using more sophisticated features or shallow linguistic techniques. Although retrieval based on terms did not show signiflcant improvement over a bag-of-words representation, our long-term objective is to cluster special text corpora into subareas, and automatically generate lexical ontologies from the clusters [Ayad and Kamel2002]. Terms in this context are of interest in themselves, and not purely as a vehicle to information retrieval. We are, furthermore, interested in similarity criteria taking into account proximity of terms [Koubarakis2001], for which again it is essential to work with terms, not words. The use of terms instead of words may also be preferable in information dissemination, where given a database of proflles (of c

References

Page 1

	Year	Citations

Page 1