Publication | Closed Access
Detailed cache coherence characterization for OpenMP benchmarks
11
Citations
25
References
2004
Year
Unknown Venue
Cluster ComputingEngineeringComputer ArchitectureMemory Model (Programming)Software AnalysisHardware SecurityShared MemoryHigh-performance ArchitectureCoherence TrafficParallel ComputingComputer EngineeringCache Coherence CharacterizationComputer ScienceCache CoherenceProgram AnalysisParallel Performance EvaluationMany-core ArchitectureParallel ProgrammingCache Coherence TrafficPerformance PortabilitySystem SoftwareOpenmp
Past work on studying cache coherence in shared-memory symmetric multiprocessors (SMPs) concentrates on studying aggregate events, often from an architecture point of view. However, this approach provides insufficient information about the exact sources of inefficiencies in parallel applications. For SMPs in contemporary clusters, application performance is impacted by the pattern of shared memory usage, and it becomes essential to understand coherence behavior in terms of the application program constructs -- such as data structures and source code lines.The technical contributions of this work are as follows. We introduce ccSIM, a cache-coherent memory simulator fed by data traces obtained through on-the-fly dynamic binary rewriting of OpenMP benchmarks executing on a Power3 SMP node. We explore the degrees of freedom in interleaving data traces from the different processors and assess the simulation accuracy by comparing with hardware performance counters. The novelty of ccSIM lies in its ability to relate coherence traffic -- specifically coherence misses as well as their progenitor invalidations -- to data structures and to their reference locations in the source program, thereby facilitating the detection of inefficiencies. Our experiments demonstrate that (a) cache coherence traffic is simulated accurately for SPMD programming styles as its invalidation traffic closely matches the corresponding hardware performance counters, (b) we derive detailed coherence information indicating the location of invalidations in the application code, i.e, source line and data structures and (c) we illustrate opportunities for optimizations from these details. By exploiting these unique features of ccSIM, we were able to identify and locate opportunities for program transformations, including interactions with OpenMP constructs, resulting in both significantly decreased coherence misses and savings of up to 73% in wall-clock execution time for several real-world benchmarks.
| Year | Citations | |
|---|---|---|
Page 1
Page 1