Alibaba Cloud opts for Ethernet over Nvidia’s interconnect, utilizing own High Performance Network for GPU connectivity in data center with 15,000 units

Alibaba Cloud opts for Ethernet over Nvidia’s interconnect, utilizing own High Performance Network for GPU connectivity in data center with 15,000 units



Alibaba Cloud engineer Ennan Zhai shared research on the design of data centers used for LLM training via GitHub. The document describes how Alibaba used Ethernet to enable 15,000 GPUs to communicate. They developed the High Performance Network (HPN) to overcome issues with uneven traffic distribution. Alibaba Cloud divided its data centers into hosts with eight GPUs each, connected to two different ToR switches to prevent failures. Despite not using NVlink for inter-host communication, Alibaba Cloud uses Nvidia’s technology for intra-host networking. They use a 51.2 Tb/sec Ethernet single-chip ToR switch that requires a novel vapor chamber heat sink solution to prevent overheating. The research will be presented at the SIGCOMM conference in August, with interest from companies like AMD, Intel, Google, and Microsoft. HPN has some drawbacks, such as a complicated cabling structure, but is more affordable than NVlink. This technology has been tested for over eight months and could potentially save institutions money on installation costs.

Article Source
https://www.tomshardware.com/tech-industry/alibaba-cloud-ditches-nvidias-interconnect-in-favor-of-ethernet-tech-giant-uses-own-high-performance-network-to-connect-15000-gpus-inside-data-center

More From Author

New ASRock B760M Intel 1700 DDR4 Motherboard with 7800X3D Processor

Critical Citrix Vulnerabilities Addressed in Latest Patch Release: Zero-Day Threats Targeting NetScaler ADC and Gateway

Listen to the Podcast Overview

Watch the Keynote