Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI | NVIDIA Technical Blog

Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI | NVIDIA Technical Blog

By Anu Srivastava
Publication Date: 2026-05-29 00:07:00

AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and language in real time—turning fragmented information into actionable insights.  

Step 3.7 Flash, the latest from StepFun, brings these capabilities to production and enterprise-scale, available on NVIDIA-accelerated infrastructure. It is a 198B-parameter Mixture-of-Experts vision-language model, with approximately 11B activated parameters per forward pass, optimized for agentic workflows that combine perception, search, and multi-step reasoning at production scale. 

With native image and video input, three configurable reasoning levels—low, medium, and high—and a 256k context window, it is designed for enterprise use cases such as financial analysis, concurrent coding agents, and other high-throughput multimodal use cases. Developers can use StepFun’s NVFP4-quantized checkpoint available through Hugging Face for boosted inference due to reduced memory bandwidth and storage requirements. 

Model Step 3.7 Flash 
Total parameters  198B 
Visual encoder parameters  1.8B 
Active parameters  11B 
Context length  256K 
Experts  288 (8 active) 
Table 1. Overview of the key Step 3.7 Flash specs, such as parameter counts, context length, and MoE configuration