Petuum: A New Platform for Distributed Machine Learning on Big Data

TLDR

Industrial‑scale machine learning requires efficient application of advanced algorithms to massive models and data, yet existing parallelization strategies vary widely and lack a universal platform. Petuum is a general‑purpose framework that tackles data‑ and model‑parallel challenges in large‑scale ML by exploiting optimization‑centric, error‑tolerant, iterative‑convergent algorithmic solutions. Its design incorporates bounded‑error network synchronization and dynamic scheduling driven by the structure of ML programs. Experiments show that Petuum enables ML programs to run faster and scale to larger models than existing implementations, even on modest compute clusters.

Abstract

What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100 s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework, Petuum, that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, showing that Petuum allows ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters.

References

Page 1

	Year	Citations

Page 1