Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Abstract

This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single- and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning. Among our numerous findings, we show that optimization and parallelization can improve double-precision performance by 25× on Intel's quad-core Nehalem, 9.4× on AMD's quad-core Barcelona, and 37.6× on Sun's Victoria Falls (dual-sockets on all systems). We also compare our single-precision version against our prior state-of-the-art GPU-based code and show, surprisingly, that the most advanced multicore architecture (Nehalem) reaches parity in both performance and power efficiency with NVIDIA's most advanced GPU architecture.

References

Page 1

	Year	Citations
The Design and Implementation of FFTW3 Matteo Frigo, Steven G. Johnson Proceedings of the IEEE EngineeringAdvanced ComputingVideo Coding FormatHardware AlgorithmMulti-rate Signal Processing	2005	5K
A fast algorithm for particle simulations Leslie Greengard, Vladimir Rokhlin Journal of Computational Physics Numerical AnalysisFast AlgorithmEngineeringPhysicsMonte Carlo	1987	4.9K
A kernel-independent adaptive fast multipole algorithm in two and three dimensions Lexing Ying, George Biros, Denis Zorin Journal of Computational Physics Geometric ModelingNumerical AnalysisAdaptive FilterEngineeringNatural Sciences	2004	490
A parallel hashed Oct-Tree N-body algorithm Michael S. Warren, John K. Salmon	1993	457
Bottom-Up Construction and 2:1 Balance Refinement of Linear Octrees in Parallel Hari Sundar, Rahul S. Sampath, George Biros SIAM Journal on Scientific Computing Cluster ComputingEngineeringGeometryComputer ArchitectureParallel Implementation	2008	192
Fast multipole methods on graphics processors Nail A. Gumerov, Ramani Duraiswami Journal of Computational Physics Geometric ModelingNumerical AnalysisNumerical ComputationEngineeringArray Computing	2008	181
Array regrouping and structure splitting using whole-program reference affinity Yutao Zhong, Maksim Orlovich, Xipeng Shen, Cluster ComputingEngineeringComputer ArchitectureSoftware EngineeringEmpirical Algorithmics	2004	128
Adapting a message-driven parallel application to GPU-accelerated clusters J. C. Phillips, John E. Stone, Klaus Schulten Cluster ComputingEngineeringGpu BenchmarkingComputer ArchitectureGpu Computing	2008	101
A massively parallel adaptive fast-multipole method on heterogeneous architectures Ilya Lashuk, Aparna Chandramowlishwaran, Harper Langston, Numerical AnalysisCluster ComputingEngineeringDistributed Memory ParallelismComputer Architecture	2009	95
A New Parallel Kernel-Independent Fast Multipole Method Lexing Ying, George Biros, Denis Zorin, Numerical AnalysisEngineeringComputer ArchitectureParallel ImplementationAnalytic Expansions	2003	81

Page 1