Publication | Closed Access
CP-Miner: finding copy-paste and related bugs in large-scale software code
600
Citations
41
References
2006
Year
Software MaintenanceEngineeringSoftware EngineeringSource Code AnalysisSoftware AnalysisFormal VerificationEmpirical Software Engineering ResearchData ScienceData MiningOperating System CodeSoftware MiningComputer ScienceDebuggerRelated BugsAutomated RepairStatic Program AnalysisSoftware DesignReplicated CodeProgram AnalysisSoftware TestingFormal MethodsParallel ProgrammingLarge Software SuitesSystem Software
Large software suites contain substantial replicated code, much of it from copy‑and‑paste, contributing to bugs, yet existing static analyzers lack scalability and fail to detect copy‑paste‑related defects. This study introduces CP‑Miner, a data‑mining tool designed to efficiently locate copy‑pasted code and uncover copy‑paste bugs in large codebases, while also exploring characteristics of such duplication. CP‑Miner analyzes copy‑pasted segments by size, granularity, and modification patterns, and examines their distribution across modules and software versions. In under 20 minutes, CP‑Miner identified 190,000 segments in Linux and 150,000 in FreeBSD, and uncovered 49 Linux and 31 FreeBSD bugs that have since been confirmed and fixed.
Recent studies have shown that large software suites contain significant amounts of replicated code. It is assumed that some of this replication is due to copy-and-paste activity and that a significant proportion of bugs in operating systems are due to copy-paste errors. Existing static code analyzers are either not scalable to large software suites or do not perform robustly where replicated code is modified with insertions and deletions. Furthermore, the existing tools do not detect copy-paste related bugs. In this paper, we propose a tool, CP-Miner, that uses data mining techniques to efficiently identify copy-pasted code in large software suites and detects copy-paste bugs. Specifically, it takes less than 20 minutes for CP-Miner to identify 190,000 copy-pasted segments in Linux and 150,000 in FreeBSD. Moreover, CP-Miner has detected many new bugs in popular operating systems, 49 in Linux and 31 in FreeBSD, most of which have since been confirmed by the corresponding developers and have been rectified in the following releases. In addition, we have found some interesting characteristics of copy-paste in operating system code. Specifically, we analyze the distribution of copy-pasted code by size (number lines of code), granularity (basic blocks and functions), and modification within copy-pasted code. We also analyze copy-paste across different modules and various software versions.
| Year | Citations | |
|---|---|---|
Page 1
Page 1