Publication | Open Access
Berkeley lab checkpoint/restart (BLCR) for Linux clusters
412
Citations
4
References
2006
Year
Application-level checkpointing and fault‑tolerant algorithms are more time‑ and space‑efficient than system‑level checkpoints, which lack application‑specific knowledge. The article presents the motivation, design, and implementation of BLCR, a system‑level checkpoint/restart tool for Linux clusters targeting typical HPC applications, including MPI. BLCR enables preemption, allowing it to respond to fault precursors such as elevated ECC error rates or sensor temperatures, and to improve batch scheduling by reducing idle cycles and queued time through shutdown without queue draining or off‑peak job placement. These capabilities make BLCR a valuable tool for efficient resource management in Linux clusters.
This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to ''fault precursors'' (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instance reducing idle cycles (by allowing for shutdown without any queue draining period or reallocation of resources to eliminate idle nodes when better fitting jobs are queued), and reducing the average queued time (by limiting large jobs to running during off-peak hours, without the need to limit the length of such jobs). Each of these potential uses makes BLCR a valuable tool for efficient resource management in Linux clusters.
| Year | Citations | |
|---|---|---|
Page 1
Page 1