Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training\n Workloads

Abstract

With widespread advances in machine learning, a number of large enterprises\nare beginning to incorporate machine learning models across a number of\nproducts. These models are typically trained on shared, multi-tenant GPU\nclusters. Similar to existing cluster computing workloads, scheduling\nframeworks aim to provide features like high efficiency, resource isolation,\nfair sharing across users, etc. However Deep Neural Network (DNN) based\nworkloads, predominantly trained on GPUs, differ in two significant ways from\ntraditional big data analytics workloads. First, from a cluster utilization\nperspective, GPUs represent a monolithic resource that cannot be shared at a\nfine granularity across users. Second, from a workload perspective, deep\nlearning frameworks require gang scheduling reducing the flexibility of\nscheduling and making the jobs themselves inelastic to failures at runtime. In\nthis paper we present a detailed workload characterization of a two-month long\ntrace from a multi-tenant GPU cluster in a large enterprise. By correlating\nscheduler logs with logs from individual jobs, we study three distinct issues\nthat affect cluster utilization for DNN training workloads on multi-tenant\nclusters: (1) the effect of gang scheduling and locality constraints on\nqueuing, (2) the effect of locality on GPU utilization, and (3) failures during\ntraining. Based on our experience running a large-scale operation, we provide\ndesign guidelines pertaining to next-generation cluster schedulers for DNN\ntraining workloads.\n