Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf

TLDR

The big data landscape struggles to process vast amounts of information quickly. This work explores and compares two distributed computing frameworks on commodity clusters: MPI/OpenMP on Beowulf and Apache Spark on Hadoop. Using Google Cloud Platform, the authors created virtual machine clusters, ran the frameworks, and evaluated KNN and Pegasos SVM supervised learning algorithms. MPI/OpenMP outperforms Spark by more than an order of magnitude in processing speed and consistency, while Spark offers superior data management, fault tolerance, and data replication.

Abstract

One of the biggest challenges of the current big data landscape is our inability to pro- cess vast amounts of information in a reasonable time. In this work, we explore and com- pare two distributed computing frameworks implemented on commodity cluster architectures: MPI/OpenMP on Beowulf that is high-performance oriented and exploits multi-machine/multi- core infrastructures, and Apache Spark on Hadoop which targets iterative algorithms through in-memory computing. We use the Google Cloud Platform service to create virtual machine clusters, run the frameworks, and evaluate two supervised machine learning algorithms: KNN and Pegasos SVM. Results obtained from experiments with a particle physics data set show MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed and provides more consistent performance. However, Spark shows better data manage- ment infrastructure and the possibility of dealing with other aspects such as node failure and data replication.

References

Page 1

	Year	Citations

Page 1