Experiments with clustering as a software remodularization method

TLDR

Reverse engineering of aging software increasingly relies on clustering, a mature yet complex technique whose traditional methods may not fully suit the reverse engineering context. The study investigates clustering algorithms and parameters for software remodularization, proposing new entity description schemes and improved evaluation methods. The authors examined three clustering aspects—entity descriptions, coupling metrics, and algorithms—using experiments on gcc, Linux, Mosaic, and a 2‑million‑LOC legacy system. The experiments confirm that proper entity descriptions and selected coupling metrics are crucial, characterize clustering algorithm quality, and support the use of novel non‑code‑based description schemes with formal evaluation.

Abstract

As valuable software systems get old, reverse engineering becomes more and more important to the companies that have to maintain the code. Clustering is a key activity in reverse engineering to discover a better design of the systems or to extract significant concepts from the code. Clustering is an old activity, highly sophisticated, offering many methods to answer different needs. Although these methods have been well documented in the past, these discussions may not apply entirely to the reverse engineering domain. We study some clustering algorithms and other parameters to establish whether and why they could be used for software remodularization. We study three aspects of the clustering activity: abstract descriptions chosen for the entities to cluster; metrics computing coupling between the entities; and clustering algorithms. The experiments were conducted on three public domain systems (gcc, Linux and Mosaic) and a real world legacy system (2 million LOC). Among other things, we confirm the importance of a proper description scheme of the entities being clustered, we list a few good coupling metrics to use and characterize the quality of different clustering algorithms. We also propose novel description schemes not directly based on the source code and we advocate better formal evaluation methods for the clustering results.

References

Page 1

	Year	Citations

Page 1