An agent loop looks simple on paper: observe, think, act, repeat. In practice, each iteration carries hidden costs that compound across steps in ways that aren’t obvious until you’re debugging a five-minute response for a task that should have taken fifteen seconds.
This post breaks down where the latency actually goes in a typical ReAct-style agent, and what levers are worth pulling.
The naive model
Most agent frameworks model latency as:
total_latency ≈ n_steps × (llm_latency + tool_latency)
This is wrong in at least three ways.
The context growth problem
Every step appends to the context: the observation, the model’s reasoning trace, the tool result. By step 10, you may be feeding 8k tokens into a model that started with 500. Prefill time scales roughly linearly with context length, so your per-step LLM cost isn’t fixed — it grows.
For a model with 50ms/1k-token prefill overhead, a 10-step loop that starts at 1k tokens and grows by 800 tokens per step adds roughly 180ms of pure prefill overhead across the run, before any generation.
Tool call serialization
The standard loop calls tools sequentially. If step 3 needs results from two independent API calls — say, a database lookup and a web search — most frameworks serialize them:
tool_a → wait → tool_b → wait → continue
Parallelizing independent tool calls in a single step is one of the highest-leverage optimizations available, and most production frameworks still don’t do it by default.
The KV cache miss
Agents frequently branch: the model reasons, decides to call a tool, gets a result it didn’t expect, and reasons again from a slightly different context. If your inference backend uses prefix caching, these branches often miss — the prefix up to the branch point is cached, but the diverged continuation is not.
In practice this means speculative or branching agents see significantly lower cache hit rates than single-turn inference, and you should budget accordingly.
What actually helps
Parallelize independent tool calls. If your agent framework supports it, batch tool calls within a step. The gains are immediate.
Cap context aggressively. Summarize or compress older turns rather than appending indefinitely. The quality hit is usually smaller than the latency hit from growing context.
Profile per-step, not end-to-end. Aggregate latency hides which steps are expensive. Log prefill tokens, generation tokens, and tool latency separately for each step.
Prefer streaming for interactive agents. Even if total latency is high, streaming the reasoning trace to the user buys perceived responsiveness.
The agent loop is not inherently slow. But treating it as a black box and optimizing the LLM call in isolation will only get you so far.