Publication | Closed Access
The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs
319
Citations
66
References
2015
Year
Software MaintenanceProgram CheckingEngineeringVerificationSoftware EngineeringSoftware AnalysisFormal VerificationAutomated Software EngineeringReliability EngineeringData ScienceStatic CheckingSoftware RepairSearch-based Software EngineeringReliabilityCommon Benchmark ProblemsComputer EngineeringIntroclass BenchmarksComputer ScienceDebuggerStatic Program AnalysisAutomated RepairSoftware DesignProgram AnalysisSoftware TestingBenchmark ProgramsFormal MethodsC ProgramsSystem SoftwareAutomated Software Repair
Automated software repair lacks common benchmark problems, and existing benchmarks are not easily adapted because they fail to provide reproducible, significant defects and deterministic assessment methods. This article introduces the ManyBugs and IntroClass benchmark datasets, detailing their requirements and providing baseline results for three repair methods to facilitate comparative evaluation. The authors compiled 1,183 defects across 15 C programs into ManyBugs and IntroClass, designed to support comparative evaluation of repair algorithms and to supply baseline results for GenProg, AE, and TrpAutoRepair. The datasets offer empirically guaranteed reproducibility and quality, with each defect categorized to enable qualitative comparisons by bug or program type.
The field of automated software repair lacks a set of common benchmark problems. Although benchmark sets are used widely throughout computer science, existing benchmarks are not easily adapted to the problem of automatic defect repair, which has several special requirements. Most important of these is the need for benchmark programs with reproducible, important defects and a deterministic method for assessing if those defects have been repaired. This article details the need for a new set of benchmarks, outlines requirements, and then presents two datasets, ManyBugs and IntroClass, consisting between them of 1,183 defects in 15 C programs. Each dataset is designed to support the comparative evaluation of automatic repair algorithms asking a variety of experimental questions. The datasets have empirically defined guarantees of reproducibility and benchmark quality, and each study object is categorized to facilitate qualitative evaluation and comparisons by category of bug or program. The article presents baseline experimental results on both datasets for three existing repair methods, GenProg, AE, and TrpAutoRepair, to reduce the burden on researchers who adopt these datasets for their own comparative evaluations.
| Year | Citations | |
|---|---|---|
Page 1
Page 1