Throughput Calculator

· LIVE

Back-of-envelope tokens/sec for a given model, precision, and hardware. Memory-bound regime only; assumes batched serving with a healthy KV cache headroom.

Model size70B params

Batch size8

Sequence length4096 tokens

Precision

GPU

Estimate

Tokens / sec / GPU

383

Memory required

155.9 GB

Fits on 1 GPU?

needs sharding

Param memory (fp8)70.0 GB

KV cache @ batch=8, seq=409685.9 GB

HBM bandwidth (H100 SXM)3350 GB/s

Compute @ fp81979 TFLOPS

⚠ Estimate is memory-bound roofline only. Actual numbers depend on kernel quality, continuous batching, speculative decoding, and a dozen other things this tool doesn't model.

What it does

Back-of-the-envelope throughput estimation that’s accurate enough to inform real architecture decisions. Plug in a model, precision, batch size, and GPU. Get tokens/sec, KV-cache size, and a verdict on whether you’ll fit on one card.

Why it’s useful

Most “how fast will this run” questions can be answered without standing up infrastructure. This tool encodes the memory-bound roofline that governs LLM inference, so you can compare options at the cost of one keystroke instead of one cluster-hour.

Limitations

Memory-bound regime only. Doesn’t model the compute-bound prefill ceiling.
Doesn’t account for continuous batching, speculative decoding, or prefix sharing — those are explicit knobs in real serving stacks.
Use the result as a rule of thumb, not a procurement quote.