Speculative decoding is one of those techniques that looks great on a benchmark and grows complicated in production. The acceptance rate of the draft model — the fraction of speculatively-generated tokens accepted by the target model — is wildly workload-dependent, and the tricks for keeping it high are not yet folklore.
Here are the four that have actually made a difference for us.