Querying the World Wide Web

TLDR

The Web is a large, heterogeneous, distributed collection of documents linked by hypertext, yet current search relies on index servers and cannot exploit the network’s structure or topology. The authors introduce WebSQL, a query language that integrates textual retrieval with structure‑ and topology‑based queries while leveraging multiple index servers transparently, and they propose a query‑locality cost theory. WebSQL’s semantics are defined by a calculus based on a novel virtual‑graph model of the document network, and the authors implemented a Java prototype.

Abstract

The World Wide Web is a large, heterogeneous, distributed collection of documents connected by hypertext links. The most common technology currently used for searching the Web depends on sending information retrieval requests to "index servers". One problem with this is that these queries cannot exploit the structure and topology of the document network. The authors propose a query language, WebSQL, that takes advantage of multiple index servers without requiring users to know about them, and that integrates textual retrieval with structure and topology-based queries. They give a formal semantics for WebSQL using a calculus based on a novel "virtual graph" model of a document network. They propose a new theory of query cost based on the idea of "query locality," that is, how much of the network must be visited to answer a particular query. Finally, they describe a prototype implementation of WebSQL written in Java.