Inside the Azure AI cloud data centers of today

Spread the love



At the recent Build event, Azure CTO Mark Russinovich discussed the evolving hardware infrastructure of Azure, with a focus on AI platforms. The hardware has become more sophisticated over the years, moving from standard server designs to varying server types including GPUs and AI accelerators. The latest innovation in 2023 revealed the significant growth of AI models, leading to the development of distributed supercomputers for training these models efficiently.
Microsoft unveiled its first large AI training supercomputer in 2020, which has since evolved into more powerful versions with more GPUs. By mid-2024, Microsoft plans to have over 30 similar supercomputers worldwide. With models like GPT-4 requiring high-bandwidth memory, Azure data centers need numerous GPUs with specific characteristics to process large amounts of data quickly.
In addition to training, Azure has developed Maia hardware for inference, equipped with a directed liquid cooling system for efficiency. The POLCA project maximizes power consumption efficiency for multiple inference operations, allowing for more servers in data centers. Managing data distribution across supercomputers is achieved through the Storage Accelerator, improving data loading speeds for training models.
High-bandwidth networks are crucial for data-parallel workloads, leading Microsoft to invest in InfiniBand connections for Open AI supercomputers and their customers. Project Forge, like Kubernetes, schedules operations and manages resource distribution. Project Flywheel ensures consistent performance for AI models across virtual GPUs, while confidential computing capabilities enhance data privacy and security during training.
Overall, Microsoft’s investment in AI infrastructure aims to support the training and utilization of large models efficiently. The ongoing developments in hardware and software ensure a secure and reliable environment for AI applications on Azure’s platform, providing valuable tools and techniques for users regardless of their need for supercomputers like those used by Open AI.

Article Source
https://www.infoworld.com/article/2337687/inside-todays-azure-ai-cloud-data-centers.html/amp/