Publication | Closed Access
Modeling and tolerating heterogeneous failures in large parallel systems
87
Citations
18
References
2011
Year
Unknown Venue
Cluster ComputingEngineeringComponent Failure DynamicsComputer ArchitectureSoftware EngineeringFault ToleranceReliability EngineeringFault AnalysisSystems EngineeringFault RecoveryModeling And SimulationParallel ComputingFailure DetectionComputer EngineeringDifferent Failure RatesComputer ScienceHeterogeneous FailuresDifferent Hardware ComponentsHigh Availability SoftwareParallel ProgrammingHigh AvailabilitySystem Software
As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume failures equally affect an application, whereas our goal is to provide failure models for applications that reflect their specific component usage. This is challenging because component failure dynamics are heterogeneous in space and time.
| Year | Citations | |
|---|---|---|
Page 1
Page 1