Concepedia

Abstract

In the past few years, data lakes emerged as a trending topic in big data technologies.Although literature presents different points of view related to its functionalities, it serves mainly to store a variety of data in a big data context.In this paper, we aim to identify and analyze data lake definitions and possible architectures.Our methodology was composed of a systematic literature mapping based on PRISMA, software engineering best practices to perform reviews, and Kappa method to assess results' quality.We performed the search in eight different electronic databases to achieve a wide variety of publishers in Computer Science.We first identified 662 papers matching our search criteria; after filtering, we selected 87 papers for review.We found that the term data lakes was first defined by James Dixon in 2010.We also found that the term is often related to raw data repositories.From the identified definitions, we propose a new one as a means to better state what data lakes refer to and improve how the community use them.Moreover, we foind that Hadoop and its ecosystem compose the most used toolset to create data lakes, revealing that this is the mainstream in architectures for data lakes as of today's available technologies.

References

YearCitations

Page 1