ML Systems

ML SystemsAn open archive for the engineers and researchers building modern machine learning systems.https://mlsystems.dev/en-usThe arithmetic of attention: why FlashAttention still mattershttps://mlsystems.dev/blog/arithmetic-of-attention/https://mlsystems.dev/blog/arithmetic-of-attention/Memory bandwidth, not FLOPs, is what bounds modern inference. A walk through the numbers behind a kernel that quietly reshaped the field.Thu, 12 Nov 2026 00:00:00 GMTLiam ChenInferenceattentionkernelsmemory-boundContinuous batching, revisitedhttps://mlsystems.dev/blog/continuous-batching-revisited/https://mlsystems.dev/blog/continuous-batching-revisited/Three years after the original paper, what does state-of-the-art serving actually look like? A field report from a team running 12B tokens a day.Fri, 06 Nov 2026 00:00:00 GMTPriya RaghavanInferencevLLMservingthroughputWhat we've been getting wrong about MoE routinghttps://mlsystems.dev/blog/moe-routing/https://mlsystems.dev/blog/moe-routing/Top-k routing has become a default. It shouldn't be. A look at the tradeoffs nobody's measuring and the experiments that change my mind.Wed, 28 Oct 2026 00:00:00 GMTHugo BelmarArchitectureMoEroutingload-balancingQuantization-aware training, end-to-endhttps://mlsystems.dev/blog/qat-2026/https://mlsystems.dev/blog/qat-2026/FP4 is here, and the gap between PTQ and QAT has widened. What's actually working in production today, and why the recipe is messier than it looks.Mon, 19 Oct 2026 00:00:00 GMTAna VoinescuQuantizationFP4QATPTQA research-grade trainer in 400 lineshttps://mlsystems.dev/blog/research-grade-trainer/https://mlsystems.dev/blog/research-grade-trainer/Most training frameworks are 50,000 lines of code in a trench coat. This is what falls out when you start from FSDP and a will to delete.Fri, 09 Oct 2026 00:00:00 GMTToma IliescuTrainingFSDPPyTorchtrainerEmbeddings as compression: the bitter lesson, retoldhttps://mlsystems.dev/blog/embeddings-compression/https://mlsystems.dev/blog/embeddings-compression/We've stopped treating embeddings like first-class data. A case for revisiting them, with measurements from a 200M-document corpus.Tue, 22 Sep 2026 00:00:00 GMTNaoko IdeRAGembeddingsretrievalWorld models and the cost of imaginationhttps://mlsystems.dev/blog/world-models-cost/https://mlsystems.dev/blog/world-models-cost/Latent rollouts are cheap. World rollouts are not. What we learned trying to scale a JEPA-style world model on robotics data.Mon, 14 Sep 2026 00:00:00 GMTFelix MarinModelsJEPAworld-modelsagentsThe hidden latency in agent loopshttps://mlsystems.dev/blog/agent-loop-latency/https://mlsystems.dev/blog/agent-loop-latency/Every tool call is a round trip. Every round trip is a context append. Why naive agent loops compound latency faster than you expect, and what to do about it.Sat, 12 Sep 2026 00:00:00 GMTFelix MarinAgentsagentslatencytool-useinferenceRAG is not retrieval-augmented generationhttps://mlsystems.dev/blog/rag-not-rag/https://mlsystems.dev/blog/rag-not-rag/It is a feedback loop, an index policy, and a re-ranker pretending to be a system. Why most RAG postmortems mistake the symptom for the disease.Fri, 28 Aug 2026 00:00:00 GMTSasha PetrovRAGRAGretrievalsystemsSpeculative decoding without the speculationhttps://mlsystems.dev/blog/speculative-decoding/https://mlsystems.dev/blog/speculative-decoding/Draft models work. They also fail in ways the original papers didn't surface. A small bag of tricks for keeping acceptance rates high in real workloads.Mon, 17 Aug 2026 00:00:00 GMTMira HolstInferencespeculative-decodingdraftingFSDP vs DeepSpeed, 2026 editionhttps://mlsystems.dev/blog/fsdp-vs-deepspeed/https://mlsystems.dev/blog/fsdp-vs-deepspeed/The choice used to be obvious. It isn't anymore. A side-by-side on training a 30B model across three clusters and four hardware generations.Tue, 04 Aug 2026 00:00:00 GMTHugo BelmarDistributedFSDPDeepSpeedtrainingThe four evals that matter (and the dozen that don't)https://mlsystems.dev/blog/evals-that-matter/https://mlsystems.dev/blog/evals-that-matter/We have too many benchmarks and too few signals. A framework for choosing evaluations that correlate with the thing you actually care about.Wed, 22 Jul 2026 00:00:00 GMTLiam ChenEvalsevalsbenchmarksNotes on KV cache paging at scalehttps://mlsystems.dev/blog/kv-cache-paging/https://mlsystems.dev/blog/kv-cache-paging/PagedAttention is a good idea poorly understood. A primer, plus the second-order effects you only see at 10,000 concurrent requests.Sat, 11 Jul 2026 00:00:00 GMTPriya RaghavanInferenceKV-cachepagingvLLM