This blog was written in collaboration with Microsoft’s DeepSpeed team, Azure ML team, and Azure HPC team.
Large transformer-based deep learning models trained on large amounts of data have shown excellent results in several cognitive tasks in recent years and are behind new products and features that augment human capabilities. These models have grown by several orders of magnitude over the past five years. Starting from a few million parameters of the original transformer model to the latest Megatron Turing model (MT-NLG 530B) with 530 billion parameters as shown in illustration 1. There is a growing need for customers to train and optimize large models at an unprecedented rate.
Illustration 1: Landscape of large models and hardware capabilities.
Azure Machine Learning (AzureML) brings large fleets of the latest GPUs powered by the InfiniBand connection to handle large-scale AI training. We are already training some of the largest models including Megatron/Turing and GPT-3 on Azure. Previously, to train these models, users had to set up and maintain a complex distributed training infrastructure that typically required multiple manual and error-prone steps. This resulted in a sub-par experience in terms of both usability and performance.
Today we are proud to announce a breakthrough in our software stack that uses DeepSpeed and 1024 A100s to scale training of a 2T parametric model with an optimized user experience at a 1K+ GPU scale. We bring these software innovations to you via AzureML (including a fully optimized PyTorch environment), which offers excellent performance and an easy-to-use interface for extensive training.
Customers can use now low speed on Azure with easy-to-use training pipelines using either the recommended AzureML recipes or via bash scripts to the VMSS-based environments. As shown in figure 2Microsoft takes a full-stack optimization approach, where all the required parts including the hardware, OS, VM image, Docker image (which includes optimized PyTorch, DeepSpeed, ONNX runtime and other Python packages) and user-centric Azure ML APIs are optimized, integrated, and extensively tested for excellent performance and scalability without unnecessary complexity.
Figure 2: Microsoft full-stack optimizations for scalable distributed training on Azure.
This optimized stack allowed us to efficiently scale large model training with DeepSpeed on Azure. We are happy to share our performance results to support you 2x larger model sizes (2 trillion vs. 1 trillion parameters), scaling up 2x more GPUs (1024 vs. 512) and up to 1.8X higher compute throughput/GPU (150 TFLOPs vs. 81 TFLOPs) compared to those published on others cloud provider.
We offer near-linear scalability in terms of both a Increasing the size of the model as well as Increase in the number of GPUs. As shown in Figure 3aalong with the DeepSpeed ZeRO-3, its novel CPU offloading capabilities, and a high-performance Azure stack powered by InfiniBand interconnects and A100 GPUs, we were able to maintain efficient throughput/GPU (>157 TFLOPs) almost linearly as the model size increased from 175 billion parameters to 2 trillion parameters. On the other hand, for a certain model size, e.g. 175 B, a nearly linear scaling as we increase the number of GPUs from 128 to 1024 as shown in Figure 3b. The key takeaway from the results presented in this blog is that Azure and DeepSpeed together break the GPU memorial wall and enable our customers to easily and efficiently train trillion-parameter models at scale.
Figure 3: (a) Near perfect throughput/GPU when increasing the model size from 175 billion to 2 trillion parameters (BS/GPU=8), (b) Nearly perfect performance scaling with increasing number of GPU devices for the 175B model (BS/GPU=16). The sequence length is 1024 in both cases.
For more information on the optimizations, technologies and detailed performance trends presented above, read our Extended Technical blog.
- learn more about low speedwhich is part of Microsoft AI at scale Initiative.
- learn more about Azure HPC + AI.
- To get started with DeepSpeed on Azure, please follow our Getting started tutorial.
- The results presented in this blog were built on Azure following the recipes and scripts published as part of Megatron DeepSpeed repository. The recommended and easiest to use way to run the training experiments is to use AzureML recipe.
- If you are running experiments in a custom environment built with Azure VMs or VMSS, please read the bash scripts we deliver in Megatron DeepSpeed.