Inference & Serving #speculative-decoding #drafting

Speculative decoding without the speculation

Draft models work. They also fail in ways the original papers didn't surface. A small bag of tricks for keeping acceptance rates high in real workloads.

Mira Holst

@mira · contributor

· Aug 17, 2026 · 16 min read

Speculative decoding is one of those techniques that looks great on a benchmark and grows complicated in production. The acceptance rate of the draft model — the fraction of speculatively-generated tokens accepted by the target model — is wildly workload-dependent, and the tricks for keeping it high are not yet folklore.

Here are the four that have actually made a difference for us.

Cite as: Holst, Mira. "Speculative decoding without the speculation." mlsystems.dev, Aug 17, 2026.

Discussionvia GitHub Discussions

Speculative decoding without the speculation

More from this topic

The arithmetic of attention: why FlashAttention still matters

Continuous batching, revisited

Notes on KV cache paging at scale