Anytime MiniBatch: Exploiting Stragglers in Online Distributed\n Optimization

Abstract

Distributed optimization is vital in solving large-scale machine learning\nproblems. A widely-shared feature of distributed optimization techniques is the\nrequirement that all nodes complete their assigned tasks in each computational\nepoch before the system can proceed to the next epoch. In such settings, slow\nnodes, called stragglers, can greatly slow progress. To mitigate the impact of\nstragglers, we propose an online distributed optimization method called Anytime\nMinibatch. In this approach, all nodes are given a fixed time to compute the\ngradients of as many data samples as possible. The result is a variable\nper-node minibatch size. Workers then get a fixed communication time to average\ntheir minibatch gradients via several rounds of consensus, which are then used\nto update primal variables via dual averaging. Anytime Minibatch prevents\nstragglers from holding up the system without wasting the work that stragglers\ncan complete. We present a convergence analysis and analyze the wall time\nperformance. Our numerical results show that our approach is up to 1.5 times\nfaster in Amazon EC2 and it is up to five times faster when there is greater\nvariability in compute node performance.\n