Publication | Closed Access
Shasta Log Aggregation, Monitoring and Alerting in HPC Environments with Grafana Loki and ServiceNow
14
Citations
0
References
2022
Year
Unknown Venue
Cluster ComputingEngineeringAutomated Log AggregationService MonitoringData ScienceComplex Event ProcessingServicenow PlatformsSystems EngineeringData IntegrationLog ManagementData ManagementHpc EnvironmentsComputer ScienceShasta Log AggregationLog AnalysisComprehensive Event ManagementPerformance MonitoringCloud ComputingGrafana LokiMonitoringSystem MonitoringIndustrial InformaticsNetwork MonitoringBig DataEvent-driven Monitoring
The ongoing deployments of post-petascale computing systems has led to the proliferation in the hybrid computing models and orchestration of many complex services leading to the growth in operational challenges. It becomes increasingly important to deploy new integrated and comprehensive event management and monitoring solutions to collect computing infrastructures health logs and metrics data, to correlate and analyze the gathered data for reducing response time and downtime in face of computational center critical issues caused due to the physical and the digital threats. To address the above mentioned challenges, in this paper we present an automated log aggregation, monitoring and alerting framework that leverages the Operations Monitoring and Networking Infrastructure (OMNI) when used with Hewlett Packard Enterprise (HPE) Shasta, Grafana Loki and ServiceNow platforms for enabling a comprehensive proactive event response management and monitoring. Moreover, herein we also present two case studies involving automated remediation workflows employing the proposed framework at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory (LBNL) using the Perlmutter computational system for real-time collection, aggregation, correlation, analysis, management and visualization of the system health metrics and logs in a single pane of glass for enhancing proactive monitoring and operational efficiency.