The arithmetic of attention: why FlashAttention still matters
Memory bandwidth, not FLOPs, is what bounds modern inference. A walk through the numbers behind a kernel that quietly reshaped the field.
An open community for learning, writing, and tinkering on the infrastructure behind modern AI — inference engines, training systems, ml stacks, and everything in between.
Memory bandwidth, not FLOPs, is what bounds modern inference. A walk through the numbers behind a kernel that quietly reshaped the field.
Three years after the original paper, what does state-of-the-art serving actually look like? A field report from a team running 12B tokens a day.
Top-k routing has become a default. It shouldn't be. A look at the tradeoffs nobody's measuring and the experiments that change my mind.
Memory bandwidth, not FLOPs, is what bounds modern inference. A walk through the numbers behind a kernel that quietly reshaped the field.
Three years after the original paper, what does state-of-the-art serving actually look like? A field report from a team running 12B tokens a day.
Draft models work. They also fail in ways the original papers didn't surface. A small bag of tricks for keeping acceptance rates high in real workloads.
PagedAttention is a good idea poorly understood. A primer, plus the second-order effects you only see at 10,000 concurrent requests.
We've stopped treating embeddings like first-class data. A case for revisiting them, with measurements from a 200M-document corpus.
It is a feedback loop, an index policy, and a re-ranker pretending to be a system. Why most RAG postmortems mistake the symptom for the disease.
Long-form threads tied to GitHub identities. Best for technical questions, paper discussions, and feature proposals. Searchable, attributed, permanent.
Open Discussions →Real-time chat for the working day. Quick questions, debugging help, paper club, and the occasional argument about whether MoE is overrated.
Join the server →Inspect attention patterns layer-by-layer for any Hugging Face model. Click any head to see its causal mask, induction behavior, and sink tokens.
Estimate tokens/sec for any model, precision, batch size, and GPU combination — memory-bound roofline only, no kernel-quality wishful thinking.
Per-GPU VRAM breakdown for training and inference — params, gradients, optimizer state, activations, KV cache, with ZeRO/FSDP sharding.
Run a small set of evaluations against any inference endpoint and get back a structured scorecard — quality, latency, cost, and refusal rate side by side.
Generate a structured model card from a checkpoint and an evaluation log — covers intended use, training data summary, evals, limitations, and ethics.
Side-by-side timings for Triton, CUDA, and PyTorch implementations of the same kernel — attention, layernorm, GEMM, custom — across shapes and dtypes.