Gemma 4 QAT Models: Optimizing Model Compression For Mobile And Laptop Efficiency

By Olivier Lacombe
Publication Date: 2026-06-05 00:00:00

Since releasing Gemma 4 two months ago, we have continually worked to expand its capabilities. First, we introduced Multi-Token Prediction (MTP) to speed up inference, and just a few days ago we released a 12B model to bridge the gap between our E4B and 26B MOE models.

Today we’re releasing new checkpoints optimized with Quantization-Aware Training (QAT) to make Gemma 4 even more efficient, allowing you to run models locally on everyday edge devices and consumer GPUs.

By simulating quantization during training, QAT minimizes the loss of quality when compressing the model. This release includes QAT checkpoints for the popular Q4_0 quantization format, as well as a novel quantization format specialized for mobile use cases. With this mobile format we have reduced the memory requirement of the Gemma 4 E2B to 1 GB. Together, these significantly reduce storage requirements while maintaining the features and quality you expect from Gemma 4.

Gemma 4 QAT Models: Optimizing model compression for mobile and laptop efficiency

Maintaining model quality while reducing size

Maintaining model quality while reducing size

Related Posts