Publication | Closed Access
An empirical study on crash recovery bugs in large-scale distributed systems
51
Citations
43
References
2018
Year
Unknown Venue
Software MaintenanceCluster ComputingEngineeringNode CrashesComputer ArchitectureSoftware EngineeringFault ToleranceFault-tolerant MessagingSoftware AnalysisReliability EngineeringLarge-scale Distributed SystemsSystems EngineeringFault RecoveryCrash Recovery MechanismsParallel ComputingData ManagementFailure DetectionReliabilityCrash Recovery BugsEmpirical StudyDistributed SystemsComputer ScienceHigh Availability SoftwareFault-tolerant NetworkProgram AnalysisSoftware TestingCloud ComputingSystem Software
In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences.
| Year | Citations | |
|---|---|---|
Page 1
Page 1