Improve Mixtral 8x7B pre-training speed with expert parallelism on Amazon SageMaker | Amazon Web Services

Spread the love

Mixture of Experts (MoE) architectures are gaining popularity for large language models (LLMs) due to their ability to increase model capacity and computational efficiency compared to fully dense models. MoE models utilize sparse expert subnetworks that process different subsets of tokens, allowing for a higher number of parameters with less computation per token during training and inference. This leads to more cost-effective training of larger models within fixed compute budgets compared to dense architectures.

However, efficiently training and fine-tuning large MoE models presents challenges. Load balancing issues can arise if tokens are unevenly distributed across experts during training, leading to some experts being overloaded while others are under-utilized. Additionally, MoE models have high memory requirements as all expert parameters need to be loaded into memory, even though only a subset is used for each input.

To address these challenges, Amazon SageMaker has introduced new features in its model parallelism library that enable efficient training of MoE models using expert parallelism. Expert parallelism involves splitting experts of an MoE model across separate workers or devices, similar to how tensor parallelism partitions dense model layers. By utilizing this feature, Amazon SageMaker demonstrated the pre-training of the 47 billion parameter Mixtral 8x7B MoE model using expert parallelism.

The Mixtral 8x7B model consists of eight expert subnetworks, each with around 7 billion parameters, and a trainable gate network called a “router” that determines which input tokens are sent to which expert. Through expert parallelism, different expert subnetworks are placed on separate devices, with each device handling the computation for the experts it contains. This approach addresses the high memory requirements of loading all experts on a single device and enables MoE training on a larger cluster.

The SMP library, which uses NVIDIA Megatron to implement expert parallelism, supports training MoE models and runs on top of PyTorch Fully Sharded Data Parallel (FSDP) APIs. By specifying the expert_parallel_degree parameter, SMP evenly divides experts across the number of GPUs in the cluster. Additionally, SMP’s expert parallelism is compatible with sharded data parallelism, further enhancing memory efficiency and training speed.

In conclusion, leveraging expert parallelism and sharded data parallelism from the Amazon SageMaker model parallelism library can effectively scale MoE architectures across multiple GPUs and workers to train large language models efficiently. These features seamlessly integrate with PyTorch and the Hugging Face Transformers library, providing performance optimizations such as hybrid sharding, delayed parameter initialization, and activation offloading to enhance training efficiency.

Article Source
https://aws.amazon.com/blogs/machine-learning/accelerate-mixtral-8x7b-pre-training-with-expert-parallelism-on-amazon-sagemaker/