Publication | Closed Access
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems
258
Citations
22
References
2011
Year
Unknown Venue
Cluster ComputingEngineeringBig Data AnalyticsMapreduce-based Warehouse SystemsStorage ManagementStorage StructureMap-reduceBig Data ProcessingData Placement StructureData ScienceData-intensive PlatformData IntegrationParallel ComputingData ManagementComputer ScienceFacebook Production SystemsSocial Network SitesCloud ComputingParallel ProgrammingDistributed Data StoreMassive Data ProcessingBig Data
MapReduce-based data warehouses, such as those used by Facebook, rely on efficient data placement structures to support rapid analytics of user behavior trends. This work introduces RCFile, a new data placement structure for Hadoop designed to meet the demands of large-scale analytics. RCFile is built to satisfy four key requirements—fast loading, fast query processing, efficient storage, and adaptability to dynamic workloads—and is compared against row-, column-, and hybrid-stores in MapReduce environments. Experimental results demonstrate that RCFile meets all four requirements, became the default storage format in Facebook’s data warehouse, and has been adopted by Hive and Pig.
MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely row-stores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo!
| Year | Citations | |
|---|---|---|
Page 1
Page 1