DETOX: A Redundancy-based Framework for Faster and More Robust Gradient\n Aggregation

Abstract

To improve the resilience of distributed training to worst-case, or Byzantine\nnode failures, several recent approaches have replaced gradient averaging with\nrobust aggregation methods. Such techniques can have high computational costs,\noften quadratic in the number of compute nodes, and only have limited\nrobustness guarantees. Other methods have instead used redundancy to guarantee\nrobustness, but can only tolerate limited number of Byzantine failures. In this\nwork, we present DETOX, a Byzantine-resilient distributed training framework\nthat combines algorithmic redundancy with robust aggregation. DETOX operates in\ntwo steps, a filtering step that uses limited redundancy to significantly\nreduce the effect of Byzantine nodes, and a hierarchical aggregation step that\ncan be used in tandem with any state-of-the-art robust aggregation method. We\nshow theoretically that this leads to a substantial increase in robustness, and\nhas a per iteration runtime that can be nearly linear in the number of compute\nnodes. We provide extensive experiments over real distributed setups across a\nvariety of large-scale machine learning tasks, showing that DETOX leads to\norders of magnitude accuracy and speedup improvements over many\nstate-of-the-art Byzantine-resilient approaches.\n