An Analysis of the Automatic Bug Fixing Performance of ChatGPT

TLDR

Automated program repair methods, ranging from synthesis to search‑based edits, have recently incorporated deep learning, and while ChatGPT is not primarily designed for repair, its bug‑fixing capabilities remain largely unknown. The study evaluates ChatGPT on the QuixBugs benchmark and compares its results to those reported for other approaches in the literature. ChatGPT’s bug‑fixing performance is comparable to deep‑learning methods CoCoNut and Codex, surpasses traditional repair techniques, and its success rate increases to 31 of 40 bugs when hints are provided, outperforming state‑of‑the‑art.

Abstract

To support software developers in finding and fixing software bugs, several automated program repair techniques have been introduced. Given a test suite, standard methods usually either synthesize a repair, or navigate a search space of software edits to find test-suite passing variants. Recent program repair methods are based on deep learning approaches. One of these novel methods, which is not primarily intended for automated program repair, but is still suitable for it, is ChatGPT. The bug fixing performance of ChatGPT, however, is so far unclear. Therefore, in this paper we evaluate ChatGPT on the standard bug fixing benchmark set, QuixBugs, and compare the performance with the results of several other approaches reported in the literature. We find that ChatGPT's bug fixing performance is competitive to the common deep learning approaches CoCoNut and Codex and notably better than the results reported for the standard program repair approaches. In contrast to previous approaches, ChatGPT offers a dialogue system through which further information, e.g., the expected output for a certain input or an observed error message, can be entered. By providing such hints to ChatGPT, its success rate can be further increased, fixing 31 out of 40 bugs, outperforming state-of-the-art.

References

Page 1

	Year	Citations

Page 1