Publication | Open Access
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
31
Citations
44
References
2021
Year
Unknown Venue
Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models.
| Year | Citations | |
|---|---|---|
Page 1
Page 1