Poseidon: An Efficient Communication Architecture for Distributed Deep\n Learning on GPU Clusters

Abstract

Deep learning models can take weeks to train on a single GPU-equipped\nmachine, necessitating scaling out DL training to a GPU-cluster. However,\ncurrent distributed DL implementations can scale poorly due to substantial\nparameter synchronization over the network, because the high throughput of GPUs\nallows more data batches to be processed per unit time than CPUs, leading to\nmore frequent network synchronization. We present Poseidon, an efficient\ncommunication architecture for distributed DL on GPUs. Poseidon exploits the\nlayered model structures in DL programs to overlap communication and\ncomputation, reducing bursty network communication. Moreover, Poseidon uses a\nhybrid communication scheme that optimizes the number of bytes required to\nsynchronize each layer, according to layer properties and the number of\nmachines. We show that Poseidon is applicable to different DL frameworks by\nplugging Poseidon into Caffe and TensorFlow. We show that Poseidon enables\nCaffe and TensorFlow to achieve 15.5x speed-up on 16 single-GPU machines, even\nwith limited bandwidth (10GbE) and the challenging VGG19-22K network for image\nclassification. Moreover, Poseidon-enabled TensorFlow achieves 31.5x speed-up\nwith 32 single-GPU machines on Inception-V3, a 50% improvement over the\nopen-source TensorFlow (20x speed-up).\n

References

Page 1

	Year	Citations

Page 1