By Asif Razzaq
Publication Date: 2026-05-28 09:08:00
Perplexity AI’s research team reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden, their inference technology repository.
At production input lengths, the new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE’s tokenizer (C), with zero steady-state heap allocations. In production, it reduced CPU utilization in Perplexity’s inference stack by 5-6x and shaved double-digit milliseconds off reranker latency.
Why Tokenization Became a Bottleneck
LLM inference cost is typically framed around GPU work: KV caches, attention kernels, expert routing. But smaller models, such as embedding models, classifiers, and rerankers, tell a different story. These models are two to three orders of magnitude smaller than frontier transformers.
A reranker scoring hundreds of candidate documents per request is a clear example. With a small model, GPU compute often finishes in single-digit milliseconds. Every input still passes through CPU-side tokenization first. When batch sizes are large, tokenization becomes a meaningful fraction of total request latency.
Perplexity’s work targets XLM-RoBERTa, a model with a 250K-token Unigram vocabulary trained with SentencePiece. Fine-tuned RoBERTa-family encoders are a common production choice for ranking, retrieval, and similarity tasks.
What is Unigram Tokenization?
Unigram…

