Supporting the hypercube programming model on mesh architectures

Abstract

Many combinatorial problems have simple solutions for parallel processing on highly-connected networks such as the butterfly or the hypercube, whereas the fastest processor-to-processor interconnections are realized in parallel machines with low dimensional mesh or torus topology. This paper presents a method for mapping binary hypercubealgorithms onto lower dimensional meshes and analyzes this method in a model derived from the architecture of modern mesh machines. We outline the criteria used to evaluate graph embeddings for mapping supercomputer communication networks. Our work was motivated by the need for fast library routines to do parallel sorting, fast Fouriertransformation and processor synchronization. During the design effort of these building blocks, we developed and analyzed a new technique to support a hypercube network embedded onto a two dimensional torus. A direct implementation of the embedding is made possible by logical channels and pathways. A fast merge sorter based on the bitonic network serves as an example to show how a simple hypercube algorithm can outperform most of the asymptotically optimal mesh algorithms for practical machine sizes. In the conventional mesh computation model, processors are allowed to exchange one unit of data with a neighbor in each step. This model needs to be refined since modern mesh computers, such as the iWarp system, have hardware support for fast non-neighbor communication. The bitonic merge sort, a simple hypercube algorithm, contains a fair amount of fine grain parallelism not found in standard mesh algorithms. This form of parallelism includes pipelined communication, computation overlapped with communication, use of wide instruction words and operands directly read from the communication system through systolic gates. The measured sorting rate of more than 2 10 keys/sec on an iWarp torus with just 64 processors shows the excellent absolute performance of our approach. The performance results compare The research was supported in part by the Defense Advanced Research Project Agency, US Department of Defense, monitored by the Space and Naval Warfare Systems Command under Contract N00039-87-C-0251,and in part by the Office of Naval Research under Contracts N00014-87-K-0385 and N00014-87-K-0533. well with much larger parallel computers. In our analysis of the relative performance we compare our approach to different sorting methods on meshes. The mapped hypercube algorithm is shown to be best for a wide range of machine and problem sizes. For the readers mainly interested in complexity results, our approach may seem somewhat surprising, but the analysis of the algorithm in an accurate model for the iWarp machine shows how good speed and good parallel efficiency is obtained from both forms of parallelism, large and fine grain.

References

Page 1

	Year	Citations
Introduction to Algorithms V. J. Rayward‐Smith, Thomas H. Cormen, Charles E. Leiserson, Journal of the Operational Research Society	1991	16.9K
Sorting networks and their applications Kenneth E. Batcher I/o DevicesMemory DesignEngineeringComputer ArchitectureNetwork Analysis	1968	2.4K
An 0(n log n) sorting network Miklós Ajtai, János Komlós, Endre Szemerédi Circuit ComplexityDirected GraphEngineeringGraph TheorySorting Network	1983	665
Sorting on a mesh-connected parallel computer Clark D. Thompson, H. T. Kung Communications of the ACM Cluster ComputingEngineeringComputer ArchitectureAnalysis Of AlgorithmComputational Complexity	1977	474
iWarp: an integrated solution to high-speed parallel computing Shekhar Borkar, Robert Cohn, George W. Cox, Conference on High Performance Computing (Supercomputing) Cluster ComputingHigh Speed SignalEngineeringComputer ArchitectureSupercomputer Architecture	1988	315
The MasPar MP-1 architecture Tom Blank EngineeringComputer ArchitectureSystem-level DesignSupercomputer ArchitectureProcessor Architecture	2002	280
An optimal sorting algorithm for mesh connected computers C. P. Schnorr, Adi Shamir	1986	199
Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms Sanguthevar Rajasekaran, John H. Reif SIAM Journal on Computing EngineeringComputer ArchitectureParallel ImplementationComputational ComplexityParallel Algorithms	1989	172
Supporting systolic and memory communication in iWarp Shekhar Borkar, Robert Cohn, George W. Cox, ACM SIGARCH Computer Architecture News EngineeringShared MemoryMessage PassingIwarp ArchitectureComputer Engineering	1990	170
Optimal Sorting Algorithms for Parallel Computers Baudet, Stevenson IEEE Transactions on Computers N ElementsCluster ComputingEngineeringComputer ArchitectureParallel Implementation	1978	157

Page 1