Publication | Closed Access
Data Cleaning
452
Citations
73
References
2016
Year
Unknown Venue
EngineeringMachine LearningData PreparationData CleaningDirty DataData ScienceData MiningManagementData IntegrationData Pre-processingData ManagementStatisticsData ModelingOutlier DetectionKnowledge DiscoveryComputer ScienceData CleansingData TreatmentBig Data
Dirty data remains a persistent challenge in analytics, driving growing industry and academic interest in new abstractions, interfaces, scalable approaches, and statistical techniques for detection and repair. The authors aim to clarify recent advances by presenting a taxonomy of data cleaning literature that emphasizes qualitative techniques using constraints, rules, or patterns, and by illustrating state‑of‑the‑art methods and their limitations. They discuss how qualitative methods can be framed within a statistical estimation framework, leveraging machine learning to enhance efficiency and accuracy, and examining the impact of cleaning on statistical analysis.
Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.
| Year | Citations | |
|---|---|---|
Page 1
Page 1