All articles.
The arithmetic of attention: why FlashAttention still matters
Memory bandwidth, not FLOPs, is what bounds modern inference. A walk through the numbers behind a kernel that quietly reshaped the field.
14 min
Continuous batching, revisited
Three years after the original paper, what does state-of-the-art serving actually look like? A field report from a team running 12B tokens a day.
19 min
What we've been getting wrong about MoE routing
Top-k routing has become a default. It shouldn't be. A look at the tradeoffs nobody's measuring and the experiments that change my mind.
22 min
Quantization-aware training, end-to-end
FP4 is here, and the gap between PTQ and QAT has widened. What's actually working in production today, and why the recipe is messier than it looks.
17 min
A research-grade trainer in 400 lines
Most training frameworks are 50,000 lines of code in a trench coat. This is what falls out when you start from FSDP and a will to delete.
25 min
Embeddings as compression: the bitter lesson, retold
We've stopped treating embeddings like first-class data. A case for revisiting them, with measurements from a 200M-document corpus.
12 min
World models and the cost of imagination
Latent rollouts are cheap. World rollouts are not. What we learned trying to scale a JEPA-style world model on robotics data.
21 min
The hidden latency in agent loops
Every tool call is a round trip. Every round trip is a context append. Why naive agent loops compound latency faster than you expect, and what to do about it.
14 min
RAG is not retrieval-augmented generation
It is a feedback loop, an index policy, and a re-ranker pretending to be a system. Why most RAG postmortems mistake the symptom for the disease.
18 min
Speculative decoding without the speculation
Draft models work. They also fail in ways the original papers didn't surface. A small bag of tricks for keeping acceptance rates high in real workloads.
16 min
FSDP vs DeepSpeed, 2026 edition
The choice used to be obvious. It isn't anymore. A side-by-side on training a 30B model across three clusters and four hardware generations.
20 min
The four evals that matter (and the dozen that don't)
We have too many benchmarks and too few signals. A framework for choosing evaluations that correlate with the thing you actually care about.
13 min
Notes on KV cache paging at scale
PagedAttention is a good idea poorly understood. A primer, plus the second-order effects you only see at 10,000 concurrent requests.
15 min