Gaia: geo-distributed machine learning approaching LAN speeds

TLDR

Machine learning relies on large‑scale, globally generated data, but transferring this data to a central data center is infeasible due to bandwidth limits and privacy constraints. The authors aim to build a geo‑distributed ML system that efficiently uses limited WAN bandwidth, preserves algorithm accuracy, and supports diverse ML workloads without code changes. Gaia achieves this by separating intra‑ and inter‑data‑center communication and employing an Approximate Synchronous Parallel model that dynamically suppresses negligible cross‑DC traffic while maintaining correctness. Experiments show that WAN communication can slow ML by up to 53.7×, but Gaia delivers 1.8–53.5× speedups over existing systems and achieves 94–140 % of LAN‑level performance.

Abstract

Machine learning (ML) is widely used to derive useful information from large-scale data (such as user activities, pictures, and videos) generated at increasingly rapid rates, all over the world. Unfortunately, it is infeasible to move all this globally-generated data to a centralized data center before running an ML algorithm over it--moving large amounts of raw data over wide-area networks (WANs) can be extremely slow, and is also subject to the constraints of privacy and data sovereignty laws. This motivates the need for a geo-distributed ML system spanning multiple data centers. Unfortunately, communicating over WANs can significantly degrade ML system performance (by as much as 53.7× in our study) because the communication overwhelms the limited WAN bandwidth. Our goal in this work is to develop a geo-distributed ML system that (1) employs an intelligent communication mechanism over WANs to efficiently utilize the scarce WAN bandwidth, while retaining the accuracy and correctness guarantees of an ML algorithm; and (2) is generic and flexible enough to run a wide range of ML algorithms, without requiring any changes to the algorithms. To this end, we introduce a new, general geo-distributed ML system, Gaia, that decouples the communication within a data center from the communication between data centers, enabling different communication and consistency models for each. We present a new ML synchronization model, Approximate Synchronous Parallel (ASP), whose key idea is to dynamically eliminate insignificant communication between data centers while still guaranteeing the correctness of ML algorithms. Our experiments on our prototypes of Gaia running across 11 Amazon EC2 global regions and on a cluster that emulates EC2 WAN bandwidth show that Gaia provides 1.8-53.5× speedup over two state-of-the-art distributed ML systems, and is within 0.94-1.40× of the speed of running the same ML algorithm on machines on a local area network (LAN).

References

Page 1

	Year	Citations

Page 1