numpywren: serverless linear algebra

TLDR

Linear algebra operations are essential in scientific computing and machine learning, yet scaling them beyond a single machine is difficult because traditional approaches require supercomputing clusters or complex configuration and management. The authors aim to demonstrate that serverless environments, by disaggregating storage and compute resources, can provide elastic scalability and simplified management for linear algebra workloads, and introduce numpywren as a system built on this architecture. They implement this by combining a serverless architecture with LAmbdaPACK, a domain‑specific language that enables highly parallel linear algebra algorithms to run efficiently in a serverless setting. Experiments show that numpywren achieves completion times within 33 % of ScaLAPACK for matrix multiply, SVD, and Cholesky, while improving CPU‑hour efficiency by up to 240 %, though its lack of intra‑machine locality limits performance on QR factorization, suggesting cloud providers could enhance support for such workloads.

Abstract

Linear algebra operations are widely used in scientific computing and machine learning applications. However, it is challenging for scientists and data analysts to run linear algebra at scales beyond a single machine. Traditional approaches either require access to supercomputing clusters, or impose configuration and cluster management challenges. In this paper we show how the disaggregation of storage and compute resources in so-called "serverless" environments, combined with compute-intensive workload characteristics, can be exploited to achieve elastic scalability and ease of management. We present numpywren, a system for linear algebra built on a serverless architecture. We also introduce LAmbdaPACK, a domain-specific language designed to implement highly parallel linear algebra algorithms in a serverless setting. We show that, for certain linear algebra algorithms such as matrix multiply, singular value decomposition, and Cholesky decomposition, numpywren's performance (completion time) is within 33% of ScaLAPACK, and its compute efficiency (total CPU-hours) is up to 240% better due to elasticity, while providing an easier to use interface and better fault tolerance. At the same time, we show that the inability of serverless runtimes to exploit locality across the cores in a machine fundamentally limits their network efficiency, which limits performance on other algorithms such as QR factorization. This highlights how cloud providers could better support these types of computations through small changes in their infrastructure.

References

Page 1

	Year	Citations

Page 1