Publication | Closed Access
Troubleshooting thousands of jobs on production grids using data mining techniques
20
Citations
21
References
2008
Year
Unknown Venue
Software MaintenanceCluster ComputingEngineeringBusiness IntelligencePattern DiscoveryFault ForecastingSoftware EngineeringPattern MiningBusiness AnalyticsSoftware AnalysisOptimization-based Data MiningReliability EngineeringData ScienceData MiningData Mining TechniquesSystems EngineeringNew ChallengesParallel ComputingKnowledge Discovery ProcessProcess MiningQuantitative ManagementFailure DetectionHigh-performance Data AnalyticsPredictive AnalyticsKnowledge DiscoveryComputer ScienceData-intensive ComputingProgram AnalysisSoftware TestingLarge Scale ProductionData Stream MiningCpu Production GridProduction GridsBusinessParallel ProgrammingIndustrial InformaticsBig Data
Large scale production computing grids introduce new challenges in debugging and troubleshooting. A user that submits a workload consisting of tens of thousands of jobs to a grid of thousands of processors has a good chance of receiving thousands of error messages as a result. How can one begin to reason about such problems? We propose that data mining techniques can be employed to classify failures according to the properties of the jobs and machines involved. We demonstrate this technique through several case studies on real workloads consisting of tens of thousands of jobs. We apply the same techniques to a yearpsilas worth of data on a 3000 CPU production grid and use it to gain a high level understanding of the system behavior.
| Year | Citations | |
|---|---|---|
Page 1
Page 1