Crawling the Hidden Web

TLDR

Current crawlers retrieve only publicly indexable web pages, ignoring the vast high‑quality content hidden behind search forms and requiring authorization. We address the problem of designing a crawler capable of extracting content from this hidden Web. We present HiWE, a prototype crawler that implements a generic operational model and employs a Layout‑based Information Extraction Technique (LITE) to automatically extract semantic information from search forms and response pages. Experimental results validate the effectiveness of the proposed techniques.

Abstract

Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of Web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content “hidden” behind search forms, in large searchable electronic databases. In this paper, we address the problem of designing a crawler capable of extracting content from this hidden Web. We introduce a generic operational model of a hidden Web crawler and describe how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford. We introduce a new Layout-based Information Extraction Technique (LITE) and demonstrate its use in automatically extracting semantic information from search forms and response pages. We also present results from experiments conducted to test and validate our techniques.

References

Page 1

	Year	Citations

Page 1