Transforming Wikipedia into Named Entity Training Data

TLDR

Statistical named entity recognisers require costly hand‑labelled training data, so most existing corpora are small. The study aims to use Wikipedia to create a massive corpus of named entity annotated text. The authors transform Wikipedia links into named entity annotations by classifying target articles into entity types such as person, organization, and location. Compared to MUC, CONLL, and BBN corpora, the Wikipedia‑derived corpus generally performs better in cross‑corpus train/test settings.

Abstract

Statistical named entity recognisers require costly hand-labelled training data and, as a result, most existing corpora are small. We exploit Wikipedia to create a massive corpus of named entity annotated text. We transform Wikipedia’s links into named entity annotations by classifying the target articles into common entity types (e.g. person, organisation and location). Comparing to MUC, CONLL and BBN corpora, Wikipedia generally performs better than other cross-corpus train/test pairs.

References

Page 1

	Year	Citations

Page 1