NeutronStar: Distributed GNN Training with Hybrid Dependency Management

Abstract

GNN's training needs to resolve issues of vertex dependencies, i.e., each vertex representation's update depends on its neighbors. Existing distributed GNN systems adopt either a dependencies-cached approach or a dependencies-communicated approach. Having made intensive experiments and analysis, we find that a decision to choose one or the other approach for the best performance is determined by a set of factors, including graph inputs, model configurations, and an underlying computing cluster environment. If various GNN trainings are supported solely by one approach, the performance results are often suboptimal. We study related factors for each GNN training before its execution to choose the best-fit approach accordingly. We propose a hybrid dependency-handling approach that adaptively takes the merits of the two approaches at runtime. Based on the hybrid approach, we further develop a distributed GNN training system called NeutronStar, which makes high performance GNN trainings in an automatic way. NeutronStar is also empowered by effective optimizations in CPU-GPU computation and data processing. Our experimental results on 16-node Aliyun cluster demonstrate that NeutronStar achieves 1.81X-14.25X speedup over existing GNN systems including DistDGL and ROC.

References

Page 1

	Year	Citations

Page 1