Probabilistic top- <i>k</i> and ranking-aggregate queries

TLDR

Ranking and aggregation queries are widely used, but most existing techniques assume deterministic data; uncertain data introduces new challenges and a probability dimension that conventional methods cannot handle. The article introduces probabilistic formulations for top‑k and ranking‑aggregate queries in probabilistic databases. The authors develop a generic processing framework that merges traditional top‑k semantics with possible‑world semantics, using a state‑space model and efficient search algorithms to reduce tuple accesses and materialized search space. Experiments demonstrate orders‑of‑magnitude efficiency gains over naïve methods across various data distributions.

Abstract

Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve data that is unclean or uncertain. Ranking and aggregating uncertain (probabilistic) data raises new challenges in query semantics and processing, making conventional methods inapplicable. Furthermore, uncertainty imposes probability as a new ranking dimension that does not exist in the traditional settings. In this article we introduce new probabilistic formulations for top- k and ranking-aggregate queries in probabilistic databases. Our formulations are based on marriage of traditional top- k semantics with possible worlds semantics. In the light of these formulations, we construct a generic processing framework supporting both query types, and leveraging existing query processing and indexing capabilities in current RDBMSs. The framework encapsulates a state space model and efficient search algorithms to compute query answers. Our proposed techniques minimize the number of accessed tuples and the size of materialized search space to compute query answers. Our experimental study shows the efficiency of our techniques under different data distributions with orders of magnitude improvement over naïve methods.

References

Page 1

	Year	Citations

Page 1