Topping The GPU MODE Kernel Leaderboard With NVIDIA Cuda.compute | NVIDIA Technical Blog

By Daniel Rodriguez
Publication Date: 2026-02-18 17:00:00

Python dominates machine learning for its ergonomics, but writing truly fast GPU code has historically meant dropping into C++ to write custom kernels and to maintain bindings back to Python. For most Python developers and researchers, this is a significant barrier to entry.

Frameworks like PyTorch address this by implementing kernels in CUDA C++—either handwritten or by leveraging libraries like the NVIDIA CUDA Core Compute Libraries. Handwritten kernels are time-consuming and require deep, low-level architectural expertise. Using CUB, a C++ library within CCCL, is often better, since its primitives are highly optimized per architecture and are rigorously tested. But exposing CUB to Python traditionally means building and maintaining bindings and pre-instantiating C++ templates with fixed types and operators—limiting flexibility on the Python side.

The NVIDIA cuda.compute library overcomes these limitations by offering a high-level, Pythonic API for device-wide CUB primitives.

Using cuda.compute helped an NVIDIA CCCL team top the GPU MODE leaderboard, a kernel competition hosted by an online community with more than 20,000 members and a focus on learning and improving GPU programming. GPU MODE hosts the kernel competitions to find the best implementations for a variety of tasks, from simple vector addition to more complex block matrix multiplications.

The NVIDIA CCCL team focuses on delivering “speed-of-light” (SOL) implementations of parallel…

Related Posts