NVIDIA Releases Polar, A Token-Faithful Rollout Framework For GRPO Training Across Codex, Claude Code, And Qwen Code

By Asif Razzaq
Publication Date: 2026-05-27 17:09:00

Reinforcement learning for language agents is growing more complex. Agents now manage multi-turn tool use, long-running contexts, and multi-agent orchestration. The main engineering challenge is connecting existing agent software to training pipelines without breaking how those tools work.

NVIDIA’s research team introduced Polar, a rollout framework that lets researchers run reinforcement learning over any agent harness without modifying that harness.

Table of Contents

The Core Problem Polar Solves

An ‘agent harness’ is a tool like Codex CLI, Claude Code, Qwen Code, or Pi. These harnesses manage system prompts, tool formatting, context engineering, and how the agent submits patches. These details directly affect agent behavior at evaluation time.

Traditional RL infrastructure requires harness logic to be rewritten behind a framework-owned environment API — typically env.init(), env.step(), env.reset() in the OpenAI Gym style. Every new harness requires new integration code. That integration can also lose execution details specific to the native harness path.

Polar’s key observation is that every LLM-based agent must call a model. That model API boundary is a common interface outside the agent itself. Instead of integrating inside the harness, Polar places a proxy at that boundary.

How the Proxy Works

For each incoming model request, the gateway proxy performs four steps:

Detect the provider API — using the request path and headers, it…

The Core Problem Polar Solves

How the Proxy Works

Related Posts