A Case Study on Inter-Annotator Agreement for Word Sense Disambiguation

Abstract

There is a general concern within the field of word sense dusamb~guatmn about the rater-annotator agreement between human annota tors . In thus paper, we examine th~s msue by comparing the agreement rate on a large corpus of more than 30,000 sense-tagged instances Thin corpus us the mtersectmn of the WORDNET Semcor corpus and the DSO corpus, which has been independently tagged by two separate groups of human annotators The contribution of this paper us two-fold First , ~t presents a greedy search algori thm tha t can automatical ly derive coarser sense classes based on the sense tags assigned by two human annotators The resulting derived coarse sense classes achmve a h~gher agreement rate but we s t f l !mamtam as many of the original sense classes as posmble Second, the coarse sense grouping derived by the algorithm, upon verification by human, can potent ial ly serve as a better sense inventory for evaluating automated word sense d~samb~guatmn algori thms Moreover, we examined the derived coarse sense classes and found some interesting groupings of word senses that correspond to human mtmtlve judgment of sense granularity 1 I n t r o d u c t i o n . It us widely acknowledged that word sense d~samblguatmn (WSD) us a central problem m natural language processing In order for computers to be able to understand and process natural language beyond simple keyword matching, the problem of d~samblguatmg word sense, or dlscermng the meamng of a word m context, must be effectively dealt with Advances in WSD v, ill have slgmficant Impact on apphcatlons hke information retrieval and machine translation For natural language subtasks hke part-of-speech tagging or s)ntactm parsing, there are relatlvely well defined and agreed-upon cnterm of what it means to have the correct part of speech or syntactic structure assigned to a word or sentence For instance, the Penn Treebank corpus (Marcus et a l , 1993) pro~ide~ ,t large repo.~tory of texts annotated w~th partof-speech and s}ntactm structure mformatlon Tv.o independent human annotators can achieve a high rate of agreement on assigning part-of-speech tags to words m a g~ven sentence Unfortunately, th~s us not the case for word sense assignment F~rstly, it is rarely the case that any two dictionaries will have the same set of sense defimtmns for a g~ven word Different d~ctlonanes tend to carve up the semantic space m a different way, so to speak Secondly, the hst of senses for a word m a typical dmtmnar~ tend to be rather refined and comprehensive This is especmlly so for the commonly used words which have a large number of senses The sense dustmctmn between the different senses for a commonly used word m a d~ctmnary hke WoRDNET (Miller, 1990) tend to be rather fine Hence, two human annotators may genuinely dusagree m their sense assignment to a word m context The agreement rate between human annotators on word sense assignment us an Important concern for the evaluatmn of WSD algorithms One would prefer to define a dusamblguatlon task for which there us reasonably hlgh agreement between human annotators The agreement rate between human annotators will then form the upper ceiling against whmh to compare the performance of WSD algorithms For instance, the SENSEVAL exerclse has performed a detaded s tudy to find out the raterannotator agreement among ~ts lexicographers taggrog the word senses (Kllgamff, 1998c, Kllgarnff, 1998a, Kflgarrlff, 1998b) 2 A C a s e S t u d y In th i s -paper , we examine the ~ssue of raterannotator agreement by comparing the agreement rate of human annotators on a large sense-tagged corpus of more than 30,000 instances of the most frequently occurring nouns and verbs of Enghsh This corpus is the intersection of the WORDNET Semcor corpus (Miller et a l , 1993) and the DSO corpus (Ng and Lee, 1996, Ng, 1997), which has been independently tagged wlth the refined senses of WORDNET by two separate groups of human annotators The Semcor corpus us a subset of the Brown corpus tagged with ~VoRDNET senses, and consists of more than 670,000 words from 352 text files Sense taggmg was done on the content words (nouns, ~erbs, adjectives and adverbs) m this subset The DSO corpus consists of sentences drawn from the Brown corpus and the Wall Street Journal For each word w from a hst of 191 frequently occurring words of Enghsh (121 nouns and 70 verbs), sentences containing w (m singular or plural form, and m its various reflectional verb form) are selected and each word occurrence w ~s tagged w~th a sense from WoRDNET There ~s a total of about 192,800 sentences in the DSO corpus m which one word occurrence has been sense-tagged m each sentence The intersection of the Semcor corpus and the DSO corpus thus consists of Brown corpus sentences m which a word occurrence w is sense-tagged m each sentence, where w Is one of.the 191 frequently oc,currmg English nouns or verbs Since this common pomon has been sense-tagged by two independent groups of human annotators, ~t serves as our data set for investigating inter-annotator agreement in this paper