By Utkarsh Uppal
Publication Date: 2026-02-18 16:00:00
As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and cost requirements. Running models with tens of billions of parameters in production, especially for conversational or voice-based AI agents, demands high throughput, low latency, and predictable service-level performance. For startups building sovereign AI models from scratch, these challenges are amplified by the need to balance model scale and accuracy with infrastructure efficiency—while also maintaining data sovereignty and cost control.
Sarvam AI, a generative AI startup based in Bengaluru, India, set out to build large, multilingual, multimodal foundation models that serve its country’s diverse population, support nearly two-dozen languages, and keep model development and data governance fully under India’s sovereign control. To meet strict latency targets and improve inference efficiency for its flagship Sovereign 30B model, Sarvam AI collaborated with NVIDIA to co-design hardware and software optimizations.
This collaboration delivered a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs, and established a path for deployment on the next-generation NVIDIA Blackwell architecture. The end-to-end performance boost was achieved through kernel and scheduling optimizations on NVIDIA H100 SXM GPUs that contributed a 2x speedup. That was combined with the powerful…