Headline: Phi-4-Vision-Reasoning—training a multimodal reasoning model

Headline: Phi-4-Vision-Reasoning—training a multimodal reasoning model

By Brenda Potts
Publication Date: 2026-03-04 18:05:00

At a glance

  • Phi-4-reasoning-vision-15B is a compact and smart open‑weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces.
  • We share lessons learned and best practices for training a multimodal reasoning model—showing the benefit of careful architecture choices, rigorous data curation, and the benefits of using a mixture of reasoning and non-reasoning data.

We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as…