<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>ML Systems</title><description>An open archive for the engineers and researchers building modern machine learning systems.</description><link>https://mlsystems.dev/</link><language>en-us</language><item><title>The arithmetic of attention: why FlashAttention still matters</title><link>https://mlsystems.dev/blog/arithmetic-of-attention/</link><guid isPermaLink="true">https://mlsystems.dev/blog/arithmetic-of-attention/</guid><description>Memory bandwidth, not FLOPs, is what bounds modern inference. A walk through the numbers behind a kernel that quietly reshaped the field.</description><pubDate>Thu, 12 Nov 2026 00:00:00 GMT</pubDate><author>Liam Chen</author><category>Inference</category><category>attention</category><category>kernels</category><category>memory-bound</category></item><item><title>Continuous batching, revisited</title><link>https://mlsystems.dev/blog/continuous-batching-revisited/</link><guid isPermaLink="true">https://mlsystems.dev/blog/continuous-batching-revisited/</guid><description>Three years after the original paper, what does state-of-the-art serving actually look like? A field report from a team running 12B tokens a day.</description><pubDate>Fri, 06 Nov 2026 00:00:00 GMT</pubDate><author>Priya Raghavan</author><category>Inference</category><category>vLLM</category><category>serving</category><category>throughput</category></item><item><title>What we&apos;ve been getting wrong about MoE routing</title><link>https://mlsystems.dev/blog/moe-routing/</link><guid isPermaLink="true">https://mlsystems.dev/blog/moe-routing/</guid><description>Top-k routing has become a default. It shouldn&apos;t be. A look at the tradeoffs nobody&apos;s measuring and the experiments that change my mind.</description><pubDate>Wed, 28 Oct 2026 00:00:00 GMT</pubDate><author>Hugo Belmar</author><category>Architecture</category><category>MoE</category><category>routing</category><category>load-balancing</category></item><item><title>Quantization-aware training, end-to-end</title><link>https://mlsystems.dev/blog/qat-2026/</link><guid isPermaLink="true">https://mlsystems.dev/blog/qat-2026/</guid><description>FP4 is here, and the gap between PTQ and QAT has widened. What&apos;s actually working in production today, and why the recipe is messier than it looks.</description><pubDate>Mon, 19 Oct 2026 00:00:00 GMT</pubDate><author>Ana Voinescu</author><category>Quantization</category><category>FP4</category><category>QAT</category><category>PTQ</category></item><item><title>A research-grade trainer in 400 lines</title><link>https://mlsystems.dev/blog/research-grade-trainer/</link><guid isPermaLink="true">https://mlsystems.dev/blog/research-grade-trainer/</guid><description>Most training frameworks are 50,000 lines of code in a trench coat. This is what falls out when you start from FSDP and a will to delete.</description><pubDate>Fri, 09 Oct 2026 00:00:00 GMT</pubDate><author>Toma Iliescu</author><category>Training</category><category>FSDP</category><category>PyTorch</category><category>trainer</category></item><item><title>Embeddings as compression: the bitter lesson, retold</title><link>https://mlsystems.dev/blog/embeddings-compression/</link><guid isPermaLink="true">https://mlsystems.dev/blog/embeddings-compression/</guid><description>We&apos;ve stopped treating embeddings like first-class data. A case for revisiting them, with measurements from a 200M-document corpus.</description><pubDate>Tue, 22 Sep 2026 00:00:00 GMT</pubDate><author>Naoko Ide</author><category>RAG</category><category>embeddings</category><category>retrieval</category></item><item><title>World models and the cost of imagination</title><link>https://mlsystems.dev/blog/world-models-cost/</link><guid isPermaLink="true">https://mlsystems.dev/blog/world-models-cost/</guid><description>Latent rollouts are cheap. World rollouts are not. What we learned trying to scale a JEPA-style world model on robotics data.</description><pubDate>Mon, 14 Sep 2026 00:00:00 GMT</pubDate><author>Felix Marin</author><category>Models</category><category>JEPA</category><category>world-models</category><category>agents</category></item><item><title>The hidden latency in agent loops</title><link>https://mlsystems.dev/blog/agent-loop-latency/</link><guid isPermaLink="true">https://mlsystems.dev/blog/agent-loop-latency/</guid><description>Every tool call is a round trip. Every round trip is a context append. Why naive agent loops compound latency faster than you expect, and what to do about it.</description><pubDate>Sat, 12 Sep 2026 00:00:00 GMT</pubDate><author>Felix Marin</author><category>Agents</category><category>agents</category><category>latency</category><category>tool-use</category><category>inference</category></item><item><title>RAG is not retrieval-augmented generation</title><link>https://mlsystems.dev/blog/rag-not-rag/</link><guid isPermaLink="true">https://mlsystems.dev/blog/rag-not-rag/</guid><description>It is a feedback loop, an index policy, and a re-ranker pretending to be a system. Why most RAG postmortems mistake the symptom for the disease.</description><pubDate>Fri, 28 Aug 2026 00:00:00 GMT</pubDate><author>Sasha Petrov</author><category>RAG</category><category>RAG</category><category>retrieval</category><category>systems</category></item><item><title>Speculative decoding without the speculation</title><link>https://mlsystems.dev/blog/speculative-decoding/</link><guid isPermaLink="true">https://mlsystems.dev/blog/speculative-decoding/</guid><description>Draft models work. They also fail in ways the original papers didn&apos;t surface. A small bag of tricks for keeping acceptance rates high in real workloads.</description><pubDate>Mon, 17 Aug 2026 00:00:00 GMT</pubDate><author>Mira Holst</author><category>Inference</category><category>speculative-decoding</category><category>drafting</category></item><item><title>FSDP vs DeepSpeed, 2026 edition</title><link>https://mlsystems.dev/blog/fsdp-vs-deepspeed/</link><guid isPermaLink="true">https://mlsystems.dev/blog/fsdp-vs-deepspeed/</guid><description>The choice used to be obvious. It isn&apos;t anymore. A side-by-side on training a 30B model across three clusters and four hardware generations.</description><pubDate>Tue, 04 Aug 2026 00:00:00 GMT</pubDate><author>Hugo Belmar</author><category>Distributed</category><category>FSDP</category><category>DeepSpeed</category><category>training</category></item><item><title>The four evals that matter (and the dozen that don&apos;t)</title><link>https://mlsystems.dev/blog/evals-that-matter/</link><guid isPermaLink="true">https://mlsystems.dev/blog/evals-that-matter/</guid><description>We have too many benchmarks and too few signals. A framework for choosing evaluations that correlate with the thing you actually care about.</description><pubDate>Wed, 22 Jul 2026 00:00:00 GMT</pubDate><author>Liam Chen</author><category>Evals</category><category>evals</category><category>benchmarks</category></item><item><title>Notes on KV cache paging at scale</title><link>https://mlsystems.dev/blog/kv-cache-paging/</link><guid isPermaLink="true">https://mlsystems.dev/blog/kv-cache-paging/</guid><description>PagedAttention is a good idea poorly understood. A primer, plus the second-order effects you only see at 10,000 concurrent requests.</description><pubDate>Sat, 11 Jul 2026 00:00:00 GMT</pubDate><author>Priya Raghavan</author><category>Inference</category><category>KV-cache</category><category>paging</category><category>vLLM</category></item></channel></rss>