Hybrid index structures for location-based web search

TLDR

Location-based web search increasingly requires indexing both textual and two‑dimensional geographic information, yet conventional set‑oriented search engines lack support for Euclidean spatial data. The study aims to devise efficient methods for representing web page location attributes and integrating textual and spatial indexes. We introduce a hybrid index that combines inverted files with R*-trees, evaluate three integration schemes, and implement a full search engine featuring geographic scope extraction, hybrid indexing, relevance ranking, and a user interface. Experiments on a large real‑world web dataset show that the second and third hybrid schemes outperform others in query time, with the second slightly better, and that R*-tree based indexes are more efficient than grid‑based ones.

Abstract

There is more and more commercial and research interest in location-based web search, i.e. finding web content whose topic is related to a particular place or region. In this type of search, location information should be indexed as well as text information. However, the index of conventional text search engine is set-oriented, while location information is two-dimensional and in Euclidean space. This brings new research problems on how to efficiently represent the location attributes of web pages and how to combine two types of indexes. In this paper, we propose to use a hybrid index structure, which integrates inverted files and R*-trees, to handle both textual and location aware queries. Three different combining schemes are studied: (1) inverted file and R*-tree double index, (2) first inverted file then R*-tree, (3) first R*-tree then inverted file. To validate the performance of proposed index structures, we design and implement a complete location-based web search engine which mainly consists of four parts: (1) an extractor which detects geographical scopes of web pages and represents geographical scopes as multiple MBRs based on geographical coordinates; (2) an indexer which builds hybrid index structures to integrate text and location information; (3) a ranker which ranks results by geographical relevance as well as non-geographical relevance; (4) an interface which is friendly for users to input location-based search queries and to obtain geographical and textual relevant results. Experiments on large real-world web dataset show that both the second and the third structures are superior in query time and the second is slightly better than the third. Additionally, indexes based on R*-trees are proven to be more efficient than indexes based on grid structures.

References

Page 1

	Year	Citations

Page 1