Publication | Open Access
BigDansing
150
Citations
33
References
2015
Year
Unknown Venue
Cluster ComputingEngineeringData ScienceCloud ComputingData Cleansing ApproachesData IntegrationBig DatasetsComputer ScienceData CleansingMassive Data ProcessingParallel ComputingCostly ComputationsMap-reduceData ManagementData-intensive ComputingBig DataHigh-performance Data Analytics
Data cleansing methods typically focus on error detection and correction while neglecting scalability to large datasets, creating a bottleneck due to expensive operations such as pair enumeration, inequality joins, and user‑defined functions. This paper introduces BigDansing, a Big Data Cleansing system designed to address efficiency, scalability, and usability challenges. BigDansing operates atop common data‑processing platforms, offering a user‑friendly interface for declarative and procedural rule specification, and transforms these rules into distributed computations with optimizations like shared scans and specialized join operators. Experiments on synthetic and real data demonstrate that BigDansing outperforms existing baseline systems by more than two orders of magnitude while maintaining repair quality.
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
| Year | Citations | |
|---|---|---|
Page 1
Page 1