Speeding up PyTorch inference using torch.compile on AWS Graviton processors | Amazon Web Services

Speeding up PyTorch inference using torch.compile on AWS Graviton processors | Amazon Web Services



PyTorch 2.0 introduced torch.compile to accelerate PyTorch code compared to the default eager mode, resulting in up to 2 times better performance for Hugging Face Model inference and up to 1.35 times better performance for Torch bank Model inference on various models on AWS Graviton3. AWS optimized PyTorch’s Torch.compile function for Graviton3 to achieve these improvements. The Torch.compile mode optimizes the entire model into a single graph to run efficiently on specific hardware platforms. These optimizations are available in Torch Python wheels and AWS Graviton PyTorch Deep Learning Container starting with PyTorch 2.3.1.

Eager mode in PyTorch executes operators immediately when encountered, causing runtime overhead due to redundant kernel startup and memory read overhead. On the other hand, Torch compilation mode synthesizes operators into a graph, optimizing memory reads and kernel startup overhead. The goal of the AWS Graviton team was to optimize the Torch.Compile backend for Graviton3 processors by reusing Arm Compute Library kernels and oneDNN primitives.

The optimizations extend the Torch inductor and oneDNN primitives to improve compiler mode performance on Graviton3 processors. Various NLP, CV, and recommendation models were tested to demonstrate the improvements, with results showing a 1.35 times improvement in latency for Torch bank models and a 2 times improvement for Hugging Face models on AWS Graviton3. Benchmarking scripts from the Hugging Face and TorchBench repositories were used to collect and compare performance metrics.

Instructions on running inference in eager and Torch.Compile modes using Torch Python wheels and benchmark scripts on AWS Graviton3 instances are provided. Environment variables are set to enhance Torch.Compile performance on Graviton3 processors. Sample output from the benchmark scripts and Torch Profiler illustrate the differences in latency and operators between eager and Torch.Compile modes.

Future plans include extending support for compiling the Llama model using the Torch Inductor CPU backend and adding support for fused GEMM kernels. In conclusion, the optimization of Torch.Compile performance on AWS Graviton3 improves PyTorch model inference speed, showcasing substantial speedups. Sunita Nadampalli, a Software Development Manager and AI/ML expert at AWS, leads the performance optimization efforts for AWS Graviton software.

Article Source
https://aws.amazon.com/blogs/machine-learning/accelerated-pytorch-inference-with-torch-compile-on-aws-graviton-processors/