Publication | Closed Access
Efficient set intersection for inverted indexing
164
Citations
21
References
2010
Year
Mathematical ProgrammingEngineeringRandom Access TechniquesComputational ComplexitySemantic WebInverted IndexingText MiningInformation RetrievalData ScienceData MiningConjunctive Boolean QueriesData RetrievalCombinatorial OptimizationData ManagementText IndexingComputer ScienceConjunctive Query QQuery OptimizationData IndexingRelational QueriesOrder-sorted LogicSearch Engine IndexingIndexing Technique
Conjunctive Boolean queries in modern IR systems rely on intersecting ordered sets of document identifiers, and while index data is highly compressible, it is typically processed via random access, creating a tension between data representation and manipulation. The authors aim to investigate the trade‑offs between using uncompressed integer representations and compressed arrangements for set intersection. They examine intersection techniques that employ both uncompressed and compressed representations and propose a simple hybrid method that combines compact storage with faster intersection computations. The hybrid method achieves both more compact storage and faster conjunctive query processing than even uncompressed representations alone.
Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when Web-scale repositories are being searched. A conjunctive query q is equivalent to a | q |-way intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this article is to explore these trade-offs, by investigating intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations.
| Year | Citations | |
|---|---|---|
Page 1
Page 1