Alibaba Cloud engineer Ennan Zhai shared research on the design of data centers used for LLM training via GitHub. The document describes how Alibaba used Ethernet to enable 15,000 GPUs to communicate. They developed the High Performance Network (HPN) to overcome issues with uneven traffic distribution. Alibaba Cloud divided its data centers into hosts with eight GPUs each, connected to two different ToR switches to prevent failures. Despite not using NVlink for inter-host communication, Alibaba Cloud uses Nvidia’s technology for intra-host networking. They use a 51.2 Tb/sec Ethernet single-chip ToR switch that requires a novel vapor chamber heat sink solution to prevent overheating. The research will be presented at the SIGCOMM conference in August, with interest from companies like AMD, Intel, Google, and Microsoft. HPN has some drawbacks, such as a complicated cabling structure, but is more affordable than NVlink. This technology has been tested for over eight months and could potentially save institutions money on installation costs.
Article Source
https://www.tomshardware.com/tech-industry/alibaba-cloud-ditches-nvidias-interconnect-in-favor-of-ethernet-tech-giant-uses-own-high-performance-network-to-connect-15000-gpus-inside-data-center