Data Wrangling: The Challenging Yourney from the Wild to the Lake.

TLDR

The growing data deluge has led to the concept of a data lake—a centralized repository of raw or minimally curated data that promises fast, cost‑free access to diverse information for analytical use. The study aims to describe the challenges of creating, filling, maintaining, and governing a data lake and to argue that a curated data lake is needed to enable broader ad‑hoc data access. The authors present and define the processes of data wrangling—creating, filling, maintaining, and governing a data lake—to achieve such curation. They find that raw data is logistically difficult to obtain, challenging to interpret, and tedious to maintain, undermining the promised benefits of a data lake.

Abstract

Much has been written about the explosion of data, also known as the “data deluge”. Similarly, much of today's research and decision making are based on the de facto acceptance that knowledge and insight can be gained from analyzing and contextualizing the vast (and growing) amount of “open” or “raw” data. The concept that the large number of data sources available today facilitates analyses on combinations of heterogeneous information that would not be achievable via “siloed” data maintained in warehouses is very powerful. The term data lake has been coined to convey the concept of a centralized repository containing virtually inexhaustible amounts of raw (or minimally curated) data that is readily made available anytime to anyone authorized to perform analytical activities. The often unstated premise of a data lake is that it relieves users from dealing with data acquisition and maintenance issues, and guarantees fast access to local, accurate and updated data without incurring development costs (in terms of time and money) typically associated with structured data warehouses. However appealing this premise, practically speaking, it is our experience, and that of our customers, that “raw” data is logistically difficult to obtain, quite challenging to interpret and describe, and tedious to maintain. Furthermore, these challenges multiply as the number of sources grows, thus increasing the need to thoroughly describe and curate the data in order to make it consumable. In this paper, we present and describe some of the challenges inherent in creating, filling, maintaining, and governing a data lake, a set of processes that collectively define the actions of data wrangling, and we propose that what is really needed is a curated data lake, where the lake contents have undergone a curation process that enable its use and deliver the promise of ad-hoc data accessibility to users beyond the enterprise IT staff.

References

Page 1

	Year	Citations

Page 1