Site icon VMVirtualMachine.com

Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate

Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate

By Asif Razzaq
Publication Date: 2026-05-28 09:08:00

Perplexity AI’s research team reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden, their inference technology repository.

At production input lengths, the new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE’s tokenizer (C), with zero steady-state heap allocations. In production, it reduced CPU utilization in Perplexity’s inference stack by 5-6x and shaved double-digit milliseconds off reranker latency.

Why Tokenization Became a Bottleneck

LLM inference cost is typically framed around GPU work: KV caches, attention kernels, expert routing. But smaller models, such as embedding models, classifiers, and rerankers, tell a different story. These models are two to three orders of magnitude smaller than frontier transformers.

A reranker scoring hundreds of candidate documents per request is a clear example. With a small model, GPU compute often finishes in single-digit milliseconds. Every input still passes through CPU-side tokenization first. When batch sizes are large, tokenization becomes a meaningful fraction of total request latency.

Perplexity’s work targets XLM-RoBERTa, a model with a 250K-token Unigram vocabulary trained with SentencePiece. Fine-tuned RoBERTa-family encoders are a common production choice for ranking, retrieval, and similarity tasks.

What is Unigram Tokenization?

Unigram…

Exit mobile version