Web-scale information extraction in knowitall

TLDR

Manually querying search engines to accumulate factual information is tedious and error‑prone, and although search engines retrieve and rank relevant documents, they do not extract facts, assess confidence, or fuse information from multiple sources. This paper introduces KnowItAll, a system that automates the extraction of large collections of facts from the web in an autonomous, domain‑independent, and scalable manner. KnowItAll assigns a probability to each extracted fact to balance precision and recall, and its architecture was analyzed to derive lessons for designing large‑scale information extraction systems. In preliminary experiments, KnowItAll extracted 54,753 facts in four days on a single machine, demonstrating its scalability and providing insights into large‑scale extraction design.

Abstract

Manually querying search engines in order to accumulate a large bodyof factual information is a tedious, error-prone process of piecemealsearch. Search engines retrieve and rank potentially relevantdocuments for human perusal, but do not extract facts, assessconfidence, or fuse information from multiple documents. This paperintroduces KnowItAll, a system that aims to automate the tedious process ofextracting large collections of facts from the web in an autonomous,domain-independent, and scalable manner.The paper describes preliminary experiments in which an instance of KnowItAll, running for four days on a single machine, was able to automatically extract 54,753 facts. KnowItAll associates a probability with each fact enabling it to trade off precision and recall. The paper analyzes KnowItAll's architecture and reports on lessons learned for the design of large-scale information extraction systems.

References

Page 1

	Year	Citations

Page 1