Modeling and tolerating heterogeneous failures in large parallel systems - Concepedia

Concepedia

Abstract

As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume failures equally affect an application, whereas our goal is to provide failure models for applications that reflect their specific component usage. This is challenging because component failure dynamics are heterogeneous in space and time.

References

	Year	Citations

Page 1