Data Wrangling for Big Data: Challenges and Opportunities

TLDR

Data wrangling—identifying, extracting, cleaning, and integrating data—produces datasets fit for analysis, yet existing ETL tools demand manual effort that becomes prohibitively costly when confronting big data’s volume, velocity, variety, and veracity. This paper argues that achieving cost‑effective, highly automated data wrangling requires fundamental research advances in data extraction, integration, cleaning, and their orchestration. The authors advocate for context‑aware, adaptive, pay‑as‑you‑go solutions that automatically tune the wrangling process to an application’s specific requirements and resource constraints.

Abstract

Data wrangling is the process by which the data required by an application is identified, extracted, cleaned and integrated, to yield a data set that is suitable for exploration and analysis. Although there are widely used Extract, Transform and Load (ETL) techniques and platforms, they often require manual work from technical and domain experts at different stages of the process. When confronted with the 4 V’s of big data (volume, velocity, variety and veracity), manual intervention may make ETL prohibitively expensive. This paper argues that providing cost-effective, highly-automated approaches to data wrangling involves significant research challenges, requiring fundamental changes to established areas such as data extraction, integration and cleaning, and to the ways in which these areas are brought together. Specifically, the paper discusses the importance of comprehensive support for context awareness within data wrangling, and the need for adaptive, pay-as-you-go solutions that automatically tune the wrangling process to the requirements and resources of the specific application.

References

Page 1

	Year	Citations

Page 1