Concepedia

Publication | Closed Access

Approaches for Resilience against Cascading Failures in Cloud Datacenters

21

Citations

42

References

2018

Year

Abstract

In a modern cloud datacenter, a cascading failure will cause many Service Level Objective (SLO) violations. In a cascading failure, when a set of physical machines (PMs) in a failure domain are failed, their workloads are transferred to the PMs in another failure domain to continue. However, the new domain receiving additional workloads may become overloaded due to the resource oversubscription feature in the cloud, which easily leads to domain failures and subsequent workload transfer to other domains. This process repeats and a cascading failure is created finally. However, few previous methods can effectively handle the cascading failures. To handle this problem, we propose a Cascading Failure Resilience System (CFRS), which incorporates three methods: Overload-Avoidance VM Reassignment (OAVR), VM backup set placement (VMset) and Dynamic Oversubscription Ratio Adjustment (DOA). The experiments in trace-driven simulation show that CFRS outperforms other comparison methods in terms of the number of domain failures, the number of failed PMs and the number of SLO violations.

References

YearCitations

Page 1