Decoupling computation and data scheduling in distributed data-intensive applications

TLDR

High‑energy physics, bioinformatics, and similar fields run many loosely coupled jobs that create and consume large data sets, and Data Grids aim to harness geographically distributed resources for such problems, but effective scheduling is difficult due to multiple metrics, constraints, and diverse resources. The authors propose a scheduling framework that tackles these challenges. The framework permits data movement to be either tightly coupled with job scheduling or handled asynchronously by a decoupled process driven by observed access patterns, and a suite of algorithms is evaluated through simulation studies. Simulation results indicate that while data replication impacts performance, coupling data movement with computation is not always required, allowing the two activities to be addressed separately and simplifying design and implementation.

Abstract

In high-energy physics, bioinformatics, and other disciplines, we encounter applications involving numerous, loosely coupled jobs that both access and generate large data sets. So-called Data Grids seek to harness geographically distributed resources for such large-scale data-intensive problems. Yet effective scheduling in such environments is challenging, due to a need to address a variety of metrics and constraints while dealing with multiple, potentially independent sources of jobs and a large number of storage, compute, and network resources. We describe a scheduling framework that addresses these problems. Within this framework, data movement operations may be either tightly bound to job scheduling decisions or, alternatively, performed by a decoupled, asynchronous process on the basis of observed data access patterns and load. We develop a family of algorithms and use simulation studies to evaluate various combinations. Our results suggest that while it is necessary to consider the impact of replication, it is not always necessary to couple data movement and computation scheduling. Instead, these two activities can be addressed separately, thus significantly simplifying the design and implementation.

References

Page 1

	Year	Citations

Page 1