Camdoop: exploiting in-network aggregation for big data applications

Abstract

Large companies like Facebook, Google, and Microsoft as well as a number of small and medium enterprises daily process massive amounts of data in batch jobs and in real time applications. This generates high network traffic, which is hard to support using traditional, oversubscribed, network infrastructures. To address this issue, several novel network topologies have been proposed, aiming at increasing the bandwidth available in enterprise clusters. We observe that in many of the commonly used work-loads, data is aggregated during the process and the output size is a fraction of the input size. This motivated us to ex-plore a different point in the design space. Instead of in-creasing the bandwidth, we focus on decreasing the traffic by pushing aggregation from the edge into the network. We built Camdoop, a MapReduce-like system running on CamCube, a cluster design that uses a direct-connect network topology with servers directly linked to other servers. Camdoop exploits the property that CamCube servers forward traffic to perform in-network aggrega-tion of data during the shuffle phase. Camdoop supports the same functions used in MapReduce and is compati-ble with existing MapReduce applications. We demon-strate that, in common cases, Camdoop significantly re-duces the network traffic and provides high performance increase over a version of Camdoop running over a switch and against two production systems, Hadoop and Dryad/DryadLINQ. 1

References

Page 1

	Year	Citations

Page 1