The Archive

All articles.

All (14) Inference & Serving (4) Training Systems (1) Architecture (2) Distributed Training (1) Quantization (1) Retrieval & RAG (2) Models (1) Agents (1) Evaluation (1) MLOps & Deployment (0)

November 2026

The arithmetic of attention: why FlashAttention still matters

Inference & Serving #attention#kernels

Memory bandwidth, not FLOPs, is what bounds modern inference. A walk through the numbers behind a kernel that quietly reshaped the field.

Liam Chen Nov 12, 2026

Continuous batching, revisited

Inference & Serving #vLLM#serving

Three years after the original paper, what does state-of-the-art serving actually look like? A field report from a team running 12B tokens a day.

Priya Raghavan Nov 06, 2026

October 2026

What we've been getting wrong about MoE routing

Architecture #MoE#routing

Top-k routing has become a default. It shouldn't be. A look at the tradeoffs nobody's measuring and the experiments that change my mind.

Hugo Belmar Oct 28, 2026

Quantization-aware training, end-to-end

Quantization #FP4#QAT

FP4 is here, and the gap between PTQ and QAT has widened. What's actually working in production today, and why the recipe is messier than it looks.

Ana Voinescu Oct 19, 2026

A research-grade trainer in 400 lines

Training Systems #FSDP#PyTorch

Most training frameworks are 50,000 lines of code in a trench coat. This is what falls out when you start from FSDP and a will to delete.

Toma Iliescu Oct 09, 2026

September 2026

Embeddings as compression: the bitter lesson, retold

Retrieval & RAG #embeddings#retrieval

We've stopped treating embeddings like first-class data. A case for revisiting them, with measurements from a 200M-document corpus.

Naoko Ide Sep 22, 2026

World models and the cost of imagination

Models #JEPA#world-models

Latent rollouts are cheap. World rollouts are not. What we learned trying to scale a JEPA-style world model on robotics data.

Felix Marin Sep 14, 2026

The hidden latency in agent loops

Agents #agents#latency

Every tool call is a round trip. Every round trip is a context append. Why naive agent loops compound latency faster than you expect, and what to do about it.

Felix Marin Sep 12, 2026

August 2026

RAG is not retrieval-augmented generation

Retrieval & RAG #RAG#retrieval

It is a feedback loop, an index policy, and a re-ranker pretending to be a system. Why most RAG postmortems mistake the symptom for the disease.

Sasha Petrov Aug 28, 2026

Speculative decoding without the speculation

Inference & Serving #speculative-decoding#drafting

Draft models work. They also fail in ways the original papers didn't surface. A small bag of tricks for keeping acceptance rates high in real workloads.

Mira Holst Aug 17, 2026

FSDP vs DeepSpeed, 2026 edition

Distributed Training #FSDP#DeepSpeed

The choice used to be obvious. It isn't anymore. A side-by-side on training a 30B model across three clusters and four hardware generations.

Hugo Belmar Aug 04, 2026

July 2026

The four evals that matter (and the dozen that don't)

Evaluation #evals#benchmarks

We have too many benchmarks and too few signals. A framework for choosing evaluations that correlate with the thing you actually care about.

Liam Chen Jul 22, 2026

Neural Networks From Zero: From a Single Number to a Billion Parameters

Architecture #fundamentals#perceptron

A neural network never sees a word, an image, or a sound — only a list of numbers. Starting from that one fact and a single neuron, this day-zero guide builds the whole machine: how any input becomes numbers, why weights, biases, and activations each exist, how neurons stack into layers and layers into a model, and how to compute a model's size and running cost by hand.

Dinesh Jul 12, 2026

Notes on KV cache paging at scale

Inference & Serving #KV-cache#paging

PagedAttention is a good idea poorly understood. A primer, plus the second-order effects you only see at 10,000 concurrent requests.

Priya Raghavan Jul 11, 2026