ITWire - Nutanix Unified Storage Takes The Lead In MLPerf Storage V1.0 Benchmark

High-performance storage – A key component in enterprise AI infrastructure
As enterprises adopt AI (including generative AI or genAI), having a fast and efficient data storage system becomes critical. AI workloads are evolving, and many enterprises still focus on training AI models, inference (interacting and using a model) and tuning (updating an existing model and augmenting it with new data without re-training) are also key considerations when implementing enterprise AI. Regardless of your AI strategy, consider the following if training a model is part of your plan:

Infrastructure requirements and cost

Time-to-business value for a trained AI model

A unified platform to not only train a model but also deploy and use it for enterprise AI applications

If training an AI model is essential to your business, choosing the right environment for the process is key. The public cloud offers a cost-effective option by allowing you to ‘rent’ AI accelerators (GPUs) without a large upfront investment. However, after training, you’ll need to reevaluate if the public cloud is still the best option for inference or tuning.

Imagine having a solution that supports both hybrid AI needs – whether on-premises or in the cloud. The Nutanix Unified Storage (NUS) platform is the answer, delivering high-performance storage and consistent experience to run your AI apps across diverse environments, with a single license.

The results are in
The table below shows the storage performance of Nutanix Unified Storage (NUS) on-premises and in public cloud (AWS) with an image classification workload (resnet50). We tested two separate NUS cluster configurations: a 32-node cluster on AWS and a 7-node cluster on-premises, both serving files data to simulated Nvidia H100 accelerators.

The results demonstrate the following:

A single NUS cluster can serve 1056 accelerators, the highest of all vendors listed in the benchmark

Performance scales linearly with the 32-node cluster supporting 4X accelerators as the 8-node cluster

Similar performance per node is observed irrespective of the location, on-premises or in the cloud

Name	Workload	Accelerator type	Accelerators achieved
Nutanix Unified Storage	resnet50	h100	1056
YanRong	resnet50	h100	540
DDN	resnet50	h100	512
Lightbits-Labs	resnet50	h100	198
Hammerspace	resnet50	h100	130
Mangoboost	resnet50	h100	128
Weka	resnet50	h100	74
HPE	resnet50	h100	34

Name

Workload

Accelerator type

Accelerators achieved

Nutanix Unified Storage

resnet50

h100

1056

YanRong

resnet50

h100

540

DDN

resnet50

h100

512

Lightbits-Labs

resnet50

h100

198

Hammerspace

resnet50

h100

130

Mangoboost

resnet50

h100

128

Weka

resnet50

h100

HPE

resnet50

h100

A table ranked by ‘accelerators achieved’ result with Nutanix Unified Storage highlighted as a top performer

The table below shows the results with a different image segmentation workload, ‘unet3d.’ This workload is extremely throughput heavy and latency sensitive and is simulated using Nvidia A100 accelerators. Again, the results demonstrate that NUS is a leader in both absolute and linear performance scaling when serving AI/ML workloads.

Name	Workload	Accelerator type	Accelerators achieved
IEI	unet3d	a100	264
Nutanix Unified Storage	unet3d	a100	195
DDN	unet3d	a100	72
Hammerspace	unet3d	a100	35
Lightbits-Labs	unet3d	a100	24
Weka	unet3d	a100	24
Mangoboost	unet3d	a100	21
HPE	unet3d	a100	9

Name

Workload

Accelerator type

Accelerators achieved

IEI

unet3d

a100

264

Nutanix Unified Storage

unet3d

a100

195

DDN

unet3d

a100

Hammerspace

unet3d

a100

Lightbits-Labs

unet3d

a100

Weka

unet3d

a100

Mangoboost

unet3d

a100

HPE

unet3d

a100

What is the MLPerf benchmark and how does it work
Simply put, the MLPerf Storage v1.0 benchmark measures how fast storage systems can supply training data when a model is being trained. It uses ‘simulated GPUs’ with some of the most powerful GPUs on the market to push a data system to its limits providing real results on what customers can expect.

It uses synthetic data and various tests such as the ‘resnet50’, or image classification – (e.g., ‘Is this a duck or a house?’). A high performance storage system provides rapid data access for the GPU servers, so that they are used efficiently. In essence, the higher the storage performance, the greater the accelerator utilisation and efficiency.

AI/ML: On-premises or in the cloud? Why not both?
AI training, tuning and inference are key examples of the different paths enterprise-led AI is changing the landscape. As AI/ML takes shape, where you deploy, operate, and adapt these critical enterprise AI activities are just as important as what you run. This is where the cloud comes into play acting as an excellent option for AI training.

Public cloud and AI training
The latest version of the Nutanix Unified Storage platform is a performance juggernaut for unstructured data, making it easier to consume, secure and run workloads like AI/ML across both on-premises and public clouds for data ingestion, training, tuning, and inferencing.

Amazon AWS EC2 P5 instances, powered by the latest Nvidia H100 GPUs, claim to “accelerate your time to solution by up to 4x compared to previous-generation GPU-based EC2 instances, and reduce cost to train ML models by up to 40%.” Yet, GPU power is only one component; faster GPUs demand storage to match, and NUS has proven itself a leader in fully optimising NvidiaH100 GPUs for AL/ML training using the latest MLPerf Storage v1.0 benchmark.

Private cloud/on-premises, AI inferencing and tuning
In the private cloud and on-premises, Nutanix Unified Storage performs efficiently using Nutanix hyperconverged infrastructure (HCI), a combination of compute, AI accelerators, storage, and networking in a single, scalable software-defined stack. As mentioned, training can be expensive on-premises because it typically requires a large investment in hardware (like GPUs), large data sets and a hefty CAPEX (capital expenditure) budget. So we see General Foundation Models trained on large open corpus are best trained in the elastic infrastructure of the public cloud.

However, there are exceptions to this for sensitive government, corporate or other private data that cannot go to the cloud and must be kept on-premises. In this case training would be completed on-premises. For both cases, inferencing and tuning shine on-premises because of the need to keep data sovereign (under your control). Here, Nutanix HCI helps scale AI infrastructure aligned to your growth – like building blocks; you don’t need to purchase all your infrastructure upfront, but rather size what you need today, and grow tomorrow as needs change.

Together, the combination of the public cloud for training, and private cloud/on-premises for inferencing and tuning make a strong duo. With NUS, you get fast, secure data storage that you control, anywhere – delivered via the same software-defined storage platform.

A new standard in AI/ML storage performance
So, how does the MLPerf Storage v1.0 benchmark factor into performance storage for AI?

The MLPerf Storage benchmark suite measures how fast storage systems can supply training data when a model is being trained, and is the ‘gold standard’ for AI/ML training storage systems. MLPerf Storage v1.0 uses ‘simulated GPUs’ with some of the most powerful GPUs on the market, to strain a data system to its limits providing real results on what customers can expect.

The MLPerf Storage v1.0 benchmark provides a submission system where vendors (like Nutanix) may voluntarily submit their verified results via a rigid rule set. Distributed training is implemented to ensure that all benchmark clients must share a single data namespace. This benchmark creates a more realistic data access pattern – adding modern GPU simulations, and additional workloads – to achieve these real-world results.

Bar chart showing a 2023 performance metric of on-premises results vs. a 2024 metric for on-premises and public cloud which are much improved.

The results above speak for themselves: The latest version of NUS performance kept the ‘unet3d’ workload constant, with accelerator speeds and throughput. Compared to previous versions, NUS with MLPerf 1.0 makes a gigantic leap in its benchmark for both on-premises and public cloud.

What we learned and recommend
Training a large model (e.g. foundation models) requires enormous amounts of data and resources and should be completed quickly. But for most customers training a model is done infrequently. For a good primer on the different parts of implementing an AI solution, reference this Nutanix Validated Design.

Training an optimised, high-performing model is challenging. AI is the most transformative technology for enterprises in recent times, with rapid adoption and a focus on quick business outcomes.

However, this shift has exposed gaps in skills, processes and architectures. Here are some recommendations to get you started on your AI journey:

Use the public cloud for faster training, unless on-premises is needed for data sovereignty or finer tuning

Shared storage works best for large datasets during training

Parallel GPU usage reduces training time

A fast, reliable network is essential for optimal performance

Unlocking high performance: Key architectural features driving success
Nutanix Unified Storage has improved performance by 2x as compared to last year’s benchmark results. In our previous blog, we covered the throughput-intensive but latency-sensitive ‘unet3d’ workload, used for image segmentation. Practical applications of this technology include medical imaging, autonomous vehicles, retail, media & entertainment, life sciences, and financial services. For example, ‘this is a picture of an arm which also has fingers’, ‘a blue background’, ‘a medical coat’ etc.’). This year, with faster GPUs and larger datasets, performance demands increased, and Nutanix Unified Storage rose to the challenge.

Here’s how we doubled performance in just a single year:

iSER (iSCSI over RDMA) was leveraged for a faster storage network between Nutanix Files and the Nutanix storage controller.

NUS leveraged a performance profile, tuned for high performance and networking throughput.

Linear scalability whether deployed with 7 nodes as we did for the on-premises submission or with 32 nodes as we did for the Files on Cloud (AWS submission).

To learn more about these features, check out Nutanix Files Product Manager Marc Waldrop’s deep dive on his blog.

What’s next for Nutanix Unified Storage
Nutanix offers the Nutanix GPT-in-a-Box turn-key AI solution that includes services and infrastructure to get started quickly with enterprise AI. The results of the Nutanix Unified Storage performance benchmark with MLPerf Storage v1.0 further solidify our position as an enterprise AI leader, expanding business outcomes for a stronger AI pipeline process, and training an AI model (LLM).

Need a deeper look at MLCommons and the storage benchmark? Check out the MLPerf Storage v1.0 readme: mlperf v1.0 readme

Article Source
https://itwire.com/guest-articles/company-news/nutanix-unified-storage-takes-the-lead-in-mlperf-storage-v1-0-benchmark.html

Related Posts