Gemma 4 Gets Multi-Token Prediction Drafters: 3x Faster Inference Without Quality Loss

On 2026-05-05, Google released a companion line of small autoregressive drafter models for the Gemma 4 family, plus a Multi-Token Prediction (MTP) head. Available on Hugging Face and Kaggle. Supported in Transformers, MLX, vLLM, SGLang, and Ollama from day one. Apache 2.0.

Canonical version: Gemma 4 Gets Multi-Token Prediction Drafters: 3x Faster Inference Without Quality Loss.

On 2026-05-05, Google released a companion line of small autoregressive drafter models for the Gemma 4 family, plus a Multi-Token Prediction (MTP) head. Available on Hugging Face and Kaggle. Supported in Transformers, MLX, vLLM, SGLang, and Ollama from day one. Apache 2.0.

Up to 3× decoding speedup with no quality degradation. On Apple Silicon with mixture-of-experts variants and batch sizes 4–8, around 2.2×.

What's different here is Speculative Decoding designed into the model, not bolted on top of it. Through 2024–2025, speculative decoding was a generic add-on; pair any small model with any big model and hope they agree. Gemma 4's drafters are co-designed with the target.

Three architectural moves:

  • Target activation sharing. The drafter consumes the final-layer activations of the target Gemma 4 model on round 1, concatenated with its own embeddings. The prompt encoding the target paid for once is reused, not redone.
  • KV cache sharing. The drafter cross-attends to the target's KV cache instead of building its own. With long contexts becoming the dominant inference cost, this is the only way to keep the drafter small without losing context.
  • Efficient embedder. Gemma 4 has a 262K-token vocabulary. The drafter uses sparse decoding via clustered token lookup; identify the most likely token cluster, then compute logits only inside it. A classic two-stage retrieval applied to LM heads.

For the deeper architectural breakdown, see AI Multi-Token Prediction Drafters.

Why it matters:

  1. It validates the "ship a drafter alongside the main model" pattern. Gemma 4 is the first major open-weight family to do this. Once one frontier lab ships drafters as a release artifact, the rest will follow. Your inference latency story is incomplete on consumer hardware without one.
  2. It rewards memory-bandwidth-bound regimes. Single-user, batch size 1, consumer GPU or Apple Silicon. The exact setup most people running local LLMs actually have. Datacenter GPUs already amortize memory transfers across batched users; consumer hardware doesn't, and that's where this lands.
  3. It's done without retraining the target. The drafters are independent artifacts. Existing Gemma 4 weights stay valid; you just add the drafter for the size you want to accelerate.

This is not the same as DeepSeek-V3-style MTP training objectives. Those change the training-time loss to predict multiple tokens at once, even if inference still emits one at a time. Co-designed drafters are an architectural artifact at inference time. Don't conflate them.

One real limitation: speedups depend on the drafter agreeing with the target. Predictable patterns (code, structured output, repeated phrases) hit the high end of the curve. Highly creative or long-tail outputs land closer to 1×. The regime you're in matters more than the headline 3×.

If you run Gemma 4 locally via Ollama or MLX, add the drafters. They're the biggest practical inference change since the April 2nd release.

References


About Sébastien

I'm Sébastien Dubois, and I'm on a mission to help knowledge workers escape information overload. After 20+ years in IT and seeing too many brilliant minds drowning in digital chaos, I've decided to help people build systems that actually work. Through the Knowii Community, my courses, products & services and my Website/Newsletter, I share practical and battle-tested systems.

I write about Knowledge Work, Personal Knowledge Management, Note-taking, Lifelong Learning, Personal Organization, Productivity, and more. I also craft lovely digital products and tools.

If you want to follow my work, then become a member and join our community.

Ready to get to the next level?

If you're tired of information overwhelm and ready to build a reliable knowledge system:

Found this valuable? Share it with someone who needs it.

Join 6,000+ readers. Get practical systems for knowledge & AI. Free.

Subscribe ✨

Free: Knowledge System Checklist

A clear roadmap to building your own knowledge system. Subscribe and get it straight to your inbox.

6,000+ readers. No spam. Unsubscribe anytime.