We thank Greg Pereira and Robert Shaw from the llm-d team for their support in bringing llm-d to AWS.
In the agentic and reasoning era, large language models (LLMs) generate 10x more tokens and compute through complex reasoning chains compared to single-shot replies. Agentic AI workflows also create highly variable demands and another exponential increase in processing, bogging down the inference process and degrading the user experience. As the world transitions from prototyping AI solutions to deploying AI at scale, efficient inference is becoming the gating factor.
LLM inference consists of two distinct phases: prefill and decode. The prefill phase is compute bound. It processes the entire input prompt in parallel to generate the initial set of key-value (KV) cache entries. The decode phase is memory bound. It autoregressively generates one token at a time while requiring substantial memory bandwidth to access model weights and the ever-growing KV cache….