Concepedia

Abstract

This paper reports on a comprehensive, fully automated resource use monitoring package, TACC Stats, which enables both consultants, users and other stakeholders in an HPC system to systematically and actively identify jobs/applications that could benefit from expert support and to aid in the diagnosis of software and hardware issues. TACC Stats continuously collects and analyzes resource usage data for every job run on a system and differs significantly from conventional profilers because it requires no action on the part of the user or consultants -- it is always collecting data on every node for every job. TACC Stats is open source and downloadable, configurable and compatible with general Linux-based computing platforms, and extensible to new CPU architectures and hardware devices. It is meant to provide a comprehensive resource usage monitoring solution. In addition to describing TACC Stats, the paper illustrates its application to identifying production jobs which have inefficient resource use characteristics.

References

YearCitations

Page 1