Publication | Closed Access
Fast failure recovery in distributed graph processing systems
40
Citations
21
References
2014
Year
Cluster ComputingEngineeringNetwork AnalysisFault-tolerant MessagingGiraph SystemRecovery ProcessSystems EngineeringDistributed GraphFault RecoveryParallel ComputingFast Failure RecoveryComputer EngineeringComputer ScienceGraph AlgorithmDistributed ProcessingFault-tolerant NetworkNetwork ScienceGraph TheoryEdge ComputingCloud ComputingParallel Programming
Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpoint-based recovery by up to 30x on a cluster of 40 compute nodes.
| Year | Citations | |
|---|---|---|
Page 1
Page 1