Publication | Closed Access
Understanding insights into the basic structure and essential issues of table placement methods in clusters
23
Citations
9
References
2013
Year
Cluster ComputingEngineeringBig Data AnalyticsDatabase ScalabilityMap-reduceBig Data ProcessingCluster TechnologyData ScienceEssential IssuesData IntegrationParallel ComputingData ManagementHigh-performance Data AnalyticsBasic StructureVery Large DatabaseComputer EngineeringComputer ScienceTable Placement MethodDatabase TechnologyData-intensive ComputingTable PartitioningCluster DevelopmentTable Placement MethodsParallel ProgrammingMassive Data ProcessingBig Data
A table placement method is a critical component in big data analytics on distributed systems. It determines the way how data values in a two-dimensional table are organized and stored in the underlying cluster. Based on Hadoop computing environments, several table placement methods have been proposed and implemented. However, a comprehensive and systematic study to understand, to compare, and to evaluate different table placement methods has not been done. Thus, it is highly desirable to gain important insights into the basic structure and essential issues of table placement methods in the context of big data processing infrastructures. In this paper, we present such a study. The basic structure of a data placement method consists of three core operations: row reordering, table partitioning, and data packing. All the existing placement methods are formed by these core operations with variations made by the three key factors: (1) the size of a horizontal logical subset of a table (or the size of a row group), (2) the function of mapping columns to column groups, and (3) the function of packing columns or column groups in a row group into physical blocks. We have designed and implemented a benchmarking tool to provide insights into how variations of each factor affect the I/O performance of reading data of a table stored by a table placement method. Based on our results, we give suggested actions to optimize table reading performance. Results from large-scale experiments have also confirmed that our findings are valid for production workloads. Finally, we present ORC File as a case study to show the effectiveness of our findings and suggested actions.
| Year | Citations | |
|---|---|---|
Page 1
Page 1