Parallel hierarchical N-body methods and their implications for multiprocessors

Abstract

To make significant advances in the design of large-scale multiprocessing systems, designers need to understand the characteristics of the applications that will run on these systems. This thesis studies the parallelization and system implications of one important class of scientific applications: N-body applications that use hierarchical solution methods. We focus primarily on the Barnes-Hut and Fast Multipole methods, the two most prominent hierarchical methods for classical N-body problems. In addition, we study a radiosity application from computer graphics, which is an application of the hierarchical N-body approach to a very different problem domain. We first address the problem of obtaining effective parallel performance on these applications, which is challenging owing to their nonuniform, dynamically changing nature and their need for long-range communication. For the classical applications, we show that simple yet effective partitioning techniques can be developed by exploiting key insights into the problems and solution methods. Using a novel partitioning technique which we propose, 45-fold speedups are obtained on a 48-processor Stanford DASH machine (a cache-coherent, shared address space multiprocessor) and 118-fold speedups are obtained on a 128-processor simulated architecture, even for relatively small problems. The radiosity application also yields good parallel performance (27-fold speedup on a 32-processor DASH), but only by resorting to dynamic task stealing. We then study the implications of this class of applications for multiprocessor architecture. We show that both a shared address space and caching are important architectural properties for achieving high performance. Finally, we examine the architectural implications of scaling the applications to run on machines with more processors. We find that using a scaling methodology which reflects the concerns of the application scientist leads to different conclusions than do more naive methods of scaling typically found in the literature. In particular, both the communication to computation ratio that a machine must support, as well as the size of a processor's working set (and hence the per-processor cache size required for effective performance) grow slowly as larger problems are run on larger machines.