Publication | Closed Access
Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models
83
Citations
50
References
2023
Year
Unknown Venue
Artificial IntelligenceEngineeringSemantic WebLarge Language ModelCorpus LinguisticsText MiningNatural Language ProcessingLarge Language ModelsData ScienceData MiningComputational LinguisticsLanguage EngineeringIncident ManagementMachine TranslationLarge Ai ModelMitigation StepsNlp TaskKnowledge DiscoveryComputer ScienceCloud ServicesSemantic ParsingCloud Incidents
Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.
| Year | Citations | |
|---|---|---|
Page 1
Page 1