Concepedia

Publication | Closed Access

<i>Snowball</i>

1.2K

Citations

14

References

2000

Year

TLDR

Text documents often contain valuable structured data hidden in regular English sentences that can be exploited as relational tables for precise queries or data mining. The study explores a technique for extracting such tables from document collections using only a handful of user‑provided training examples. Snowball generates extraction patterns from these examples, iteratively evaluates and retains high‑quality patterns and tuples, and applies novel strategies to extract relational tables from plain‑text documents. The authors develop a scalable evaluation methodology and metrics, and demonstrate Snowball’s effectiveness in a thorough experimental evaluation on more than 300,000 newspaper documents.

Abstract

Text documents often contain valuable structured data that is hidden Yin regular English sentences. This data is best exploited infavailable as arelational table that we could use for answering precise queries or running data mining tasks.We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection.We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents.At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention,and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.

References

YearCitations

Page 1