Publication | Open Access
Towards estimation error guarantees for distinct values
233
Citations
23
References
2000
Year
Unknown Venue
Mathematical ProgrammingParameter EstimationDensity EstimationEngineeringData ScienceRobust StatisticUncertainty QuantificationDistinct ValuesStatistical FoundationLarge TablesKnowledge DiscoveryRare Event EstimationStatistical InferenceComputer ScienceHeuristic EstimatorsEstimation TheoryApproximation TheoryStatistics
We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.
| Year | Citations | |
|---|---|---|
Page 1
Page 1