Publication | Closed Access
Supporting the hypercube programming model on mesh architectures
16
Citations
16
References
1992
Year
Unknown Venue
Cluster ComputingEngineeringComputer ArchitectureNetwork AnalysisComputer-aided DesignInterconnection Network ArchitectureHigh Performance ComputingSupercomputer ArchitectureMesh OptimizationIwarp TorusModeling And SimulationParallel ComputingComputational GeometryMesh ArchitecturesGeometric ModelingMassively-parallel ComputingComputer EngineeringComputer ScienceUnstructured Mesh GenerationTorus TopologyGraph TheoryEdge ComputingNatural SciencesParallel ProcessingMesh ReductionParallel ProgrammingDimensional Torus
Many combinatorial problems have simple solutions for parallel processing on highly-connected networks such as the butterfly or the hypercube, whereas the fastest processor-to-processor interconnections are realized in parallel machines with low dimensional mesh or torus topology. This paper presents a method for mapping binary hypercubealgorithms onto lower dimensional meshes and analyzes this method in a model derived from the architecture of modern mesh machines. We outline the criteria used to evaluate graph embeddings for mapping supercomputer communication networks. Our work was motivated by the need for fast library routines to do parallel sorting, fast Fouriertransformation and processor synchronization. During the design effort of these building blocks, we developed and analyzed a new technique to support a hypercube network embedded onto a two dimensional torus. A direct implementation of the embedding is made possible by logical channels and pathways. A fast merge sorter based on the bitonic network serves as an example to show how a simple hypercube algorithm can outperform most of the asymptotically optimal mesh algorithms for practical machine sizes. In the conventional mesh computation model, processors are allowed to exchange one unit of data with a neighbor in each step. This model needs to be refined since modern mesh computers, such as the iWarp system, have hardware support for fast non-neighbor communication. The bitonic merge sort, a simple hypercube algorithm, contains a fair amount of fine grain parallelism not found in standard mesh algorithms. This form of parallelism includes pipelined communication, computation overlapped with communication, use of wide instruction words and operands directly read from the communication system through systolic gates. The measured sorting rate of more than 2 10 keys/sec on an iWarp torus with just 64 processors shows the excellent absolute performance of our approach. The performance results compare The research was supported in part by the Defense Advanced Research Project Agency, US Department of Defense, monitored by the Space and Naval Warfare Systems Command under Contract N00039-87-C-0251,and in part by the Office of Naval Research under Contracts N00014-87-K-0385 and N00014-87-K-0533. well with much larger parallel computers. In our analysis of the relative performance we compare our approach to different sorting methods on meshes. The mapped hypercube algorithm is shown to be best for a wide range of machine and problem sizes. For the readers mainly interested in complexity results, our approach may seem somewhat surprising, but the analysis of the algorithm in an accurate model for the iWarp machine shows how good speed and good parallel efficiency is obtained from both forms of parallelism, large and fine grain.
| Year | Citations | |
|---|---|---|
1991 | 16.9K | |
1968 | 2.4K | |
1983 | 665 | |
1977 | 474 | |
1988 | 315 | |
2002 | 280 | |
1986 | 199 | |
1989 | 172 | |
1990 | 170 | |
1978 | 157 |
Page 1
Page 1