Publication | Open Access
Efficiently localizing system anomalies for cloud infrastructures: a novel Dynamic Graph Transformer based Parallel Framework
51
Citations
33
References
2024
Year
Cluster ComputingAnomaly DetectionMachine LearningEngineeringBig Data AnalyticsCloud Computing ArchitectureNetwork AnalysisGraph ProcessingData ScienceData MiningDistributed CloudParallel ComputingOutlier DetectionComputer EngineeringComputer ScienceCloud InfrastructuresParallel FrameworkScalable ComputingCloud ComputingCloud EnvironmentGrid ComputingSystem AnomaliesGraph Neural Network
Abstract Cloud environment is a virtual, online, and distributed computing environment that provides users with large-scale services. And cloud monitoring plays an integral role in protecting infrastructures in the cloud environment. Cloud monitoring systems need to closely monitor various KPIs of cloud resources, to accurately detect anomalies. However, due to the complexity and highly dynamic nature of the cloud environment, anomaly detection for these KPIs with various patterns and data quality is a huge challenge, especially those massive unlabeled data. Besides, it’s also difficult to improve the accuracy of the existing anomaly detection methods. To solve these problems, we propose a novel Dynamic Graph Transformer based Parallel Framework (DGT-PF) for efficiently detect system anomalies in cloud infrastructures, which utilizes Transformer with anomaly attention mechanism and Graph Neural Network (GNN) to learn the spatio-temporal features of KPIs to improve the accuracy and timeliness of model anomaly detection. Specifically, we propose an effective dynamic relationship embedding strategy to dynamically learn spatio-temporal features and adaptively generate adjacency matrices, and soft cluster each GNN layer through Diffpooling module. In addition, we also use nonlinear neural network model and AR-MLP model in parallel to obtain better detection accuracy and improve detection performance. The experiment shows that the DGT-PF framework have achieved the highest F1-Score on 5 public datasets, with an average improvement of 21.6% compared to 11 anomaly detection models.
| Year | Citations | |
|---|---|---|
Page 1
Page 1