Quantization-aware training, end-to-end

Post-training quantization works remarkably well at 8 bits. It works pretty well at 4 bits with care. At FP4, the gap between PTQ and QAT has widened to the point where, for most production workloads, QAT is no longer optional.

This article is the recipe we’ve converged on, after about six months of trying things that didn’t work.

The recipe

We start from a BF16 checkpoint, apply a short calibration pass, then do 1-3% of the original training compute as QAT with straight-through estimators on the weight quantizers. Activation quantizers are learned.

The order matters more than I expected. Quantizing weights first and activations second gives systematically better recovery than the reverse.