Unsupervised Models for Named Entity Classification

TLDR

Named‑entity classification typically requires many rules, implying a large labeled dataset is needed for adequate coverage. This study investigates using unlabeled data for named‑entity classification and proposes two algorithms. The approach leverages natural redundancy by combining spelling and contextual cues, implementing a Yarowsky‑style algorithm with Blum‑Mitchell modifications and a second algorithm that adapts boosting techniques to the semi‑supervised framework. Unlabeled data reduce the supervision requirement to only seven simple seed rules.

Abstract

This paper discusses the use of unlabeled examples for the problem of named entity classification. A large number of rules is needed for coverage of the domain, suggesting that a fairly large number of labeled examples should be required to train a classifier. However, we show that the use of unlabeled data can reduce the requirements for supervision to just 7 simple seed rules. The approach gains leverage from natural redundancy in the data: for many named-entity instances both the spelling of the name and the context inwhich it appears are sufficient to determine its type. We present two algorithms. The first method uses a similar algorithm to that of (Yarowsky 95), with modifications motivated by (Blum and Mitchell 98). The second algorithm extends ideas from boosting algorithms, designed for supervised learning tasks, to the framework suggested by (Blum and Mitchell 98).

References

Page 1

	Year	Citations

Page 1