Publication | Open Access
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
267
Citations
8
References
2008
Year
Unknown Venue
Cluster ComputingComputational ScienceMassively-parallel ComputingEngineeringStencil Computation OptimizationMulticore ArchitecturesHigh-performance ArchitectureMany-core ArchitectureComputer EngineeringComputer ArchitectureMulticore SystemsPde SolversParallel ProgrammingComputer ScienceParallel ComputingHigh Performance ComputingManycore ProcessorGpu Computing
Emerging multicore systems pose a long‑standing challenge for efficient design and utilization across mainstream and scientific computing industries. The study develops optimization strategies and an auto‑tuning environment to reduce runtime while enhancing performance portability for multicore stencil computations. The authors explore nearest‑neighbor stencil algorithms, implement a suite of optimization techniques, and evaluate them on a wide range of multicore architectures—including Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280—using an auto‑tuning framework. The auto‑tuning methodology achieves the fastest multicore stencil performance reported to date and reveals key architectural trade‑offs that inform scientific algorithm development.
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations --- a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.
| Year | Citations | |
|---|---|---|
Page 1
Page 1