Publication | Closed Access
Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct
11
Citations
11
References
2012
Year
Unknown Venue
Large-scale Global OptimizationCluster ComputingEngineeringHigh Performance Computer NetworkDistributed AlgorithmsComputer ArchitectureRdma AlgorithmNetwork ManagementParallel ComputingCombinatorial OptimizationManycore ProcessorComputer EngineeringComputer ScienceDistributed ProcessingScalable ComputingConnectx Core-directDistributed ComputingEdge ComputingCloud ComputingCollective OperationMany-core ArchitectureParallel ProgrammingDistributed Data Store
The all-to-all collective communication operation is used by many scientific applications, and is one of the most time consuming and challenging collective operation to optimize. The algorithms for all-to-all operations typically fall into two classes, logarithmic and linear scaling algorithms, with Bruck's algorithm, a logarithmic scaling algorithm, used in many small-data all-to-all implementations. The recent addition of InfiniBand CORE-Direct support for network management of collective communications offers new opportunities for optimizing all-to-all operation as well as supporting truly asynchronous implementations of these operations. This paper presents several new enhancements to the Bruck small-data algorithm that leverage CORE-Direct and other InfiniBand network capabilities to produce efficient implementations of this collective operation. These include RDMA, SR-RNR, and SR-RTR algorithms. In addition, nonblocking implementations of these collective operations are also presented. Benchmark results show that the RDMA algorithm, which uses CORE-Direct capabilities to offload collective communication management to the Host Channel Adapter (HCA), hardware gather support for sending non-continuous data, and low-latency RDMA semantics, performs the best. For a 64 processes and 128 byte-per-process all-to-all, the RDMA algorithm performs 27% better than Bruck's algorithm implementation in Open MPI and 136% better than the SR-RTR algorithm. In addition, the nonblocking versions of these algorithms have the same performance characteristics as the blocking algorithms. Finally, measurements of computation/communication overlap capacity show that all offloaded algorithms achieve about 98% overlap for large data all-to-all, whereas implementations using host-based progress achieve only about 9.5% overlap.
| Year | Citations | |
|---|---|---|
Page 1
Page 1