Publication | Closed Access
Critical event prediction for proactive management in large-scale computer clusters
265
Citations
19
References
2003
Year
Unknown Venue
Cluster ComputingEngineeringFault ForecastingCluster TechnologyReliability EngineeringContinuous MonitoringData ScienceData MiningComplex Event ProcessingSystems EngineeringEvent LogsAutonomic ComputingPredictive AnalyticsKnowledge DiscoveryComputer ScienceSystem ManagementUrgent ComputingCritical Event PredictionAutomationSystem MonitoringIndustrial InformaticsFailure PredictionBig DataEvent-driven Monitoring
As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70% accuracy.
| Year | Citations | |
|---|---|---|
Page 1
Page 1