Concepedia

Publication | Closed Access

Optimizing data shuffling in data-parallel computation by understanding user-defined functions

60

Citations

33

References

2012

Year

Abstract

Map/Reduce style data-parallel computation is charac-terized by the extensive use of user-defined functions for data processing and relies on data-shuffling stages to prepare data partitions for parallel computation. In-stead of treating user-defined functions as “black boxes”, we propose to analyze those functions to turn them into “gray boxes ” that expose opportunities to optimize da-ta shuffling. We identify useful functional properties for user-defined functions, and propose SUDO, an optimiza-tion framework that reasons about data-partition proper-ties, functional properties, and data shuffling. We have assessed this optimization opportunity on over 10,000 data-parallel programs used in production SCOPE clus-ters, and designed a framework that is incorporated it in-to the production system. Experiments with real SCOPE programs on real production data have shown that this optimization can save up to 47 % in terms of disk and net-work I/O for shuffling, and up to 48 % in terms of cross-pod network traffic. 1

References

YearCitations

Page 1