Focused Crawling Using Context Graphs

TLDR

Exhaustive crawling cannot keep search engine indices current due to the web’s growth and dynamism, so focused crawlers target category‑specific subsets, yet they struggle with credit assignment along crawl paths to avoid short‑term gains at the expense of more valuable pages. The authors aim to solve credit‑assignment challenges in focused crawling by proposing an algorithm that models the context of topically relevant pages. The algorithm builds a context model that captures typical link hierarchies and co‑occurring content, and it exploits large search engines’ partial reverse‑crawling capabilities. The algorithm achieves significant gains in crawling efficiency compared to standard focused crawling.

Abstract

Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific category, and offer a potential solution to the currency problem. The major problem in focused crawling is performing appropriate credit assignment to different documents along a crawl path, such that short-term gains are not pursued at the expense of less-obvious crawl paths that ultimately yield larger sets of valuable pages. To address this problem we present a focused crawling algorithm that builds a model for the context within which topically relevant pages occur on the web. This context model can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages. Our algorithm further leverages the existing capability of large search engines to provide partial reverse crawling capabilities. Our algorithm shows significant performance improvements in crawling efficiency over standard focused crawling.

References

Page 1

	Year	Citations

Page 1