Publication | Open Access
Improving data quality: consistency and accuracy
312
Citations
24
References
2007
Year
Data quality hinges on consistency and accuracy, with inconsistencies and errors typically arising from violations of integrity constraints. The study aims to develop automated methods that produce repairs of dirty databases that are both consistent with conditional functional dependencies and accurate within a predefined bound. The authors use conditional functional dependencies to model consistency, propose two algorithms—one for computing a repair satisfying the CFDs and another for incremental repair after updates—and develop a statistical method that guarantees accuracy above a predefined rate. The authors demonstrate that repairing consistency and accuracy is NP‑hard, yet their heuristic algorithms perform effectively and efficiently, and the accompanying statistical method ensures repairs meet a predefined accuracy threshold with minimal user interaction.
Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D' that satisfies the constraints and minimally differs from D. Equally important is to ensure that the automatically-generated repair D' is accurate, or makes sense, i.e., D' differs from the correct data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D' that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction.
| Year | Citations | |
|---|---|---|
Page 1
Page 1