Throughput Calculator

· LIVE

Back-of-envelope tokens/sec for a given model, precision, and hardware. Memory-bound regime only; assumes batched serving with a healthy KV cache headroom.

Model size70B params
Batch size8
Sequence length4096 tokens
Precision
GPU
Estimate
Tokens / sec / GPU
383
Memory required
155.9 GB
Fits on 1 GPU?
needs sharding
Param memory (fp8)70.0 GB
KV cache @ batch=8, seq=409685.9 GB
HBM bandwidth (H100 SXM)3350 GB/s
Compute @ fp81979 TFLOPS
⚠ Estimate is memory-bound roofline only. Actual numbers depend on kernel quality, continuous batching, speculative decoding, and a dozen other things this tool doesn't model.

What it does

Back-of-the-envelope throughput estimation that’s accurate enough to inform real architecture decisions. Plug in a model, precision, batch size, and GPU. Get tokens/sec, KV-cache size, and a verdict on whether you’ll fit on one card.

Why it’s useful

Most “how fast will this run” questions can be answered without standing up infrastructure. This tool encodes the memory-bound roofline that governs LLM inference, so you can compare options at the cost of one keystroke instead of one cluster-hour.

Limitations

  • Memory-bound regime only. Doesn’t model the compute-bound prefill ceiling.
  • Doesn’t account for continuous batching, speculative decoding, or prefix sharing — those are explicit knobs in real serving stacks.
  • Use the result as a rule of thumb, not a procurement quote.