Generative AI models require vast amounts of data to be trained efficiently, leading to longer training times as the models grow in complexity. This not only increases operating costs but also hinders innovation due to slow training speeds. Traditional networks are unable to provide the low latency and large scale required for generative AI model training.
To address this issue, we have developed a unique approach by creating our own network devices and operating systems for every layer of the network stack. This level of control allows us to enhance security, reliability, and performance for our customers, while also enabling faster innovation. One example of this is the Elastic Fabric Adapter (EFA), a custom-built network interface designed by AWS in 2019 that allows Amazon EC2 instances to run high-inter-node communication applications at scale using the Scalable Reliable Datagram (SRD) protocol.
In 2020, we introduced the UltraCluster network to support 4,000 GPUs with a latency of eight microseconds between servers. Building on this success, the UltraCluster 2.0 network was developed in just seven months to support over 20,000 GPUs with a 25% reduction in latency. Named the “10p10u” network internally, it offers tens of petabits per second throughput with a round-trip time of less than 10 microseconds, resulting in at least a 15% reduction in model training time.
Our commitment to investing in our own custom network devices and software has allowed us to quickly deliver cutting-edge network solutions for generative AI workloads. By continuously working to reduce network latency and improve performance, we are enabling customers to leverage the full potential of their AI models while staying ahead of the competition in terms of innovation and efficiency.
Article Source
https://www.aboutamazon.com/news/aws/aws-infrastructure-generative-ai