IBM Research Reveals Affordable AI Inferencing Using Speculative Decoding

IBM Research Reveals Affordable AI Inferencing Using Speculative Decoding



IBM Research has made a breakthrough in AI inference by combining speculative decoding and paginated attention to enhance the cost performance of large language models. This advancement aims to boost the efficiency and profitability of customer service chatbots.

Large language models (LLMs) have enhanced chatbots’ ability to comprehend customer inquiries and provide precise responses in recent years. Nevertheless, the high cost and slow speed of these models have impeded wider AI adoption. Speculative decoding serves as an optimization technique to accelerate AI inference by generating tokens more quickly, potentially reducing latency by two to three times for an improved customer experience.

However, reducing latency typically results in a trade-off concerning lower performance or fewer users that can utilize the model simultaneously, leading to increased operating costs. IBM Research has tackled this challenge by halving the latency of its Granite 20B open-source code model and quadrupling its performance.

Speculative decoding enhances token generation efficiency in LLMs, which use a transformational architecture that is inefficient in text generation. By modifying the process to evaluate multiple potential tokens at once and validating them for faster token generation, speculative decoding can significantly speed up inference by maximizing the efficiency of each GPU.

IBM researchers have adapted the Medusa speculator by conditioning future tokens on each other, improving response speed. They have also introduced paginated attention to optimize memory usage, inspired by virtual memory and paging concepts. This technique divides key-value sequences into smaller blocks, reducing redundant computation and freeing up memory.

IBM has integrated speculative decoding and paginated attention into its Granite 20B model, with plans to implement these techniques on its watsonx platform, benefiting enterprise AI applications. The company has open-sourced Hugging Face, enabling developers to adopt these optimization methods for their LLMs.

This advancement by IBM Research signifies a major leap forward in AI inference technology, offering potential benefits for a wide range of applications beyond customer service chatbots. By improving cost performance and efficiency, these techniques have the potential to revolutionize the way AI models are deployed and utilized in various industries.

Article Source
https://blockchain.news/news/ibm-research-cost-effective-ai-inferencing-speculative-decoding