Recovering traceability links between code and documentation

TLDR

Software system documentation is typically informal natural‑language free text, such as requirement specifications, design documents, manual pages, development journals, error logs, and maintenance reports, and programmers use meaningful identifiers whose mnemonics capture domain knowledge that can link high‑level concepts to program elements. We propose an information‑retrieval method to recover traceability links between source code and free‑text documents. We apply probabilistic and vector‑space IR models in two case studies to map C++ source code to manual pages and Java code to functional requirements. We compare the two models, discuss their benefits and limitations, and outline directions for improvement.

Abstract

Software system documentation is almost always expressed informally in natural language and free text. Examples include requirement specifications, design documents, manual pages, system development journals, error logs, and related maintenance reports. We propose a method based on information retrieval to recover traceability links between source code and free text documents. A premise of our work is that programmers use meaningful names for program items, such as functions, variables, types, classes, and methods. We believe that the application-domain knowledge that programmers process when writing the code is often captured by the mnemonics for identifiers; therefore, the analysis of these mnemonics can help to associate high-level concepts with program concepts and vice-versa. We apply both a probabilistic and a vector space information retrieval model in two case studies to trace C++ source code onto manual pages and Java code to functional requirements. We compare the results of applying the two models, discuss the benefits and limitations, and describe directions for improvements.

References

Page 1

	Year	Citations

Page 1