Publication | Closed Access
Automated Intelligent Healing in Cloud-Scale Data Centers
12
Citations
19
References
2021
Year
Unknown Venue
Modern cloud-scale data centers necessitate self-healing (i.e., the automation of detecting and repairing component failures) to support reliable and scalable cloud services in the face of prevalent failures. Traditional policy-based self-healing solutions rely on expert knowledge to define the proper policies for choosing repair actions, and hence are error-prone and non-scalable in practical deployment. We propose AIHS, an automated intelligent healing system that applies machine learning to achieve scalable self-healing in cloud-scale data centers. AIHS is designed as a full-fledged, general pipeline that supports various machine learning models for predicting accurate repair actions based on raw monitoring logs. We conduct extensive trace-driven and production experiments, and show that AIHS achieves higher prediction accuracy than current self-healing solutions and successfully fixes 92.4% of the total of 33.7 million production failures over seven months. AIHS also reduces 51% of unavailable time of each failed server on average compared to policy-based self-healing. AIHS is now deployed in production cloud-scale data centers at Alibaba with a total of 600 K servers. We open-source a Python prototype that reproduces the self-healing pipeline of AIHS for public validation.
| Year | Citations | |
|---|---|---|
Page 1
Page 1