The use of web-based statistics to validate, information extraction

Abstract

The World Wide Web is a powerful and readily available text corpus that can be used effectively to validate the output of an information extraction system. We present experiments that explore how pointwise mutual information (PMI) from search engine hit counts can be used in an Assessor module that assigns a probability that an extracted fact or relationship is correct, thus boosting precision. We find that thresholding on PMI scores is more effective in creating features for the Assessor than using probability density models. Bootstrapping can be effective in finding both positive and negative seeds to train the Assessor, performing better than hand-tagging a sample of actual extractions.

References

Page 1

	Year	Citations

Page 1