Concepedia

Publication | Closed Access

Wrapper induction for information extraction

1K

Citations

0

References

1997

Year

Abstract

Many Internet information resources present relational data---telephone directories, product catalogs, etc. Because these sites are formatted for people, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, procedures to extract data from information resources. We introduce wrapper induction, a method for automatically constructing wrappers, and identify hlrt, a wrapper class that is efficiently learnable, yet expressive enough to handle 48% of a recently surveyed sample of Internet resources. We use PAC analysis to bound the problem's sample complexity, and show that the system degrades gracefully with imperfect labeling knowledge. 1 Introduction The Internet contains many sources of relational data. For example, when queried with a name, email address services return hname; emaili pairs. But because these sites are designed for people, the content is formatted for human browsing (e.g. an html page), rather than for use...