Alibaba Cloud Opts For Ethernet Over Nvidia's Interconnect, Utilizing Own High Performance Network For GPU Connectivity In Data Center With 15,000 Units

Alibaba Cloud engineer Ennan Zhai shared research on the design of data centers used for LLM training via GitHub. The document describes how Alibaba used Ethernet to enable 15,000 GPUs to communicate. They developed the High Performance Network (HPN) to overcome issues with uneven traffic distribution. Alibaba Cloud divided its data centers into hosts with eight GPUs each, connected to two different ToR switches to prevent failures. Despite not using NVlink for inter-host communication, Alibaba Cloud uses Nvidia’s technology for intra-host networking. They use a 51.2 Tb/sec Ethernet single-chip ToR switch that requires a novel vapor chamber heat sink solution to prevent overheating. The research will be presented at the SIGCOMM conference in August, with interest from companies like AMD, Intel, Google, and Microsoft. HPN has some drawbacks, such as a complicated cabling structure, but is more affordable than NVlink. This technology has been tested for over eight months and could potentially save institutions money on installation costs.

Article Source
https://www.tomshardware.com/tech-industry/alibaba-cloud-ditches-nvidias-interconnect-in-favor-of-ethernet-tech-giant-uses-own-high-performance-network-to-connect-15000-gpus-inside-data-center

More From Author

Intel stock jumps 12% as new CEO takes over – Tech in Asia

Tern AI’s low-cost GPS alternative actually works | TechCrunch

Suki AI is everywhere—and aims to be invisible

Listen to the Podcast Overview

Watch the Keynote

Share this:

Listen to the Podcast Overview

Watch the Keynote