DeepSpeed - Concepedia

TLDR

DeepSpeed explores new techniques to advance large‑model training by improving scale, speed, cost, and usability, enabling 100‑billion‑parameter models and achieving the world‑fastest BERT pretraining record with transformer kernel advancements. DeepSpeed, compatible with PyTorch, introduces the ZeRO optimizer that reduces resources for model and data parallelism, and incorporates transformer kernel advancements to accelerate BERT pretraining. These breakthroughs enabled the creation of Turing‑NLG, a 17‑billion‑parameter language model that was the largest publicly known at its release.

Abstract

Explore new techniques in Microsoft's open source library called DeepSpeed, which advances large model training by improving scale, speed, cost, and usability, unlocking the ability to train 100-billion-parameter models. DeepSpeed is compatible with PyTorch. One piece of our library, called ZeRO, is a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers have used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), which at the time of its release was the largest publicly known language model at 17 billion parameters. In addition we will also go over our latest transformer kernel advancements that led the DeepSpeed team to achieve the world fastest BERT pretraining record.

References

Page 1

	Year	Citations

Page 1