KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant. | Towards Data Science

By Aman Vasisht
Publication Date: 2026-04-19 11:00:00

any time with Transformers, you already know attention is the brain of the whole operation. It is what lets the model figure out which tokens are talking to each other, and that one mechanism is responsible for almost everything impressive LLMs do.

Attention works with three components: Query (Q), Key (K), and Value (V) [1]. The dot product between Q and K is what tells the model how much each token should focus on the others, and that is essentially the core of what attention does.

Now, calling attention the “brain” also means it comes with a cost. During inference, every time a new token is being predicted, the K and V matrices are recalculated for all the previous tokens too. So if 90 tokens are already there and the model is predicting the 91st, it goes back and recomputes KV for all 90. Isn’t this repetitiveness a waste?

KV cache changed this. The idea is straightforward, instead of recomputing, just store the K and V matrices in VRAM and reuse them during…

Related Posts