Nvidia’s new open weights Nemotron 3 super combines three different architectures to beat gpt-oss and Qwen in throughput

Nvidia’s new open weights Nemotron 3 super combines three different architectures to beat gpt-oss and Qwen in throughput

By Carl Franzen
Publication Date: 2026-03-11 23:42:00

Multi-agent systems, designed to handle long-horizon tasks like software engineering or cybersecurity triaging, can generate up to 15 times the token volume of standard chats — threatening their cost-effectiveness in handling enterprise tasks.

But today, Nvidia sought to help solve this problem with the release of Nemotron 3 Super, a 120-billion-parameter hybrid model, with weights posted on Hugging Face.

By merging disparate architectural philosophies—state-space models, transformers, and a novel “Latent” mixture-of-experts design—Nvidia is attempting to provide the specialized depth required for agentic workflows without the bloat typical of dense reasoning models, and all available for commercial usage under mostly open weights.

Triple hybrid architecture

At the core of Nemotron 3 Super is a sophisticated architectural triad that balances memory efficiency with precision reasoning. The model utilizes a Hybrid Mamba-Transformer backbone, which interleaves Mamba-2 layers with strategic Transformer attention layers.

To understand the implications for enterprise production, consider the “needle in a haystack” problem. Mamba-2 layers act like a “fast-travel” highway system, handling the vast majority of sequence processing with linear-time complexity. This allows the model to maintain a massive 1-million-token context window without the memory footprint of the KV cache exploding. However, pure state-space models often struggle with associative recall. 

To fix this,…