Anatomy of AI Requests

Four patterns, drawn out

March 28, 2026AIAgentsRAGPatterns

Every AI feature shipping today is one of four shapes underneath. The model alone. The model plus retrieval. The model plus tools. The model in a loop. Each shape has its own latency profile, cost curve, and failure modes - and they compose. The diagrams below trace one full request through each, with the moving parts labelled.

1. Single LLM call

The smallest unit. Text in, tokens out. Underneath there is still a tokenizer, a context window assembled from system + user (+ optional prior assistant) messages, a prefill pass that warms the KV cache, and then autoregressive decoding that emits one token at a time over a streamed connection.

Latency anatomy: Two distinct numbers matter. Time-to-first-token (TTFT) is dominated by the prefill pass over input tokens. After that, throughput is roughly constant tokens-per-second during decode. Streaming hides the second number behind the first.
Cost asymmetry: Output tokens are typically 3-5x more expensive than input tokens, because decode is sequential and cannot batch across users like prefill can. Cap max_tokens aggressively or pay for completions you never read.
KV cache: Prefill writes one KV pair per input token. Repeated system prompts and shared prefixes can be cached server-side (prompt caching), dropping TTFT by 70-90% on cache hits. Cache keys are prefix-sensitive - reorder messages and you miss.
When this is enough: Classification, rewriting, summarization of in-context content, structured extraction, and anything the model already knows. If you find yourself stuffing knowledge into the system prompt to make this pattern work, you have outgrown it - move to RAG.

2. Retrieval-Augmented Generation

When the answer depends on knowledge the model was not trained on - your docs, last quarter's incidents, the new product spec - retrieval puts the relevant chunks into the prompt at query time. The query is embedded, scored against a vector index, optionally reranked by a cross-encoder, and the top-K chunks are stitched into context with citations.

Embedding model: A specialised encoder (often a small transformer) that maps text to a fixed-dim vector - 384, 768, 1536 are common. Same model must be used for both indexing and querying or scores are meaningless.
ANN index: Linear scan over a million vectors is too slow at query time. HNSW (graph-based) and IVF (inverted file) trade a small recall hit for sub-linear search. Tune ef_search / nprobe to balance latency and recall@K.
Reranking: Embedding similarity is fast but coarse - it sees query and chunk independently. A cross-encoder sees them jointly and scores how well the chunk actually answers the query. Adds 100-500ms but typically lifts precision@3 by 10-30%.
Failure modes: Chunks too small lose surrounding context, too large blow the token budget. Embedding mismatch (different model, different normalization) returns nearest-neighbour nonsense. Stale index = confidently wrong answers. Always show citations so users can verify.

3. Multi-tool calls

The model is given a set of tool schemas - name, description, JSON parameters - as part of its prompt. In one inference, it can decide to call zero, one, or many of them, with arguments. The runtime dispatches those calls (ideally in parallel), collects the results, and feeds them back into a second inference that produces the final answer.

Two LLM hops: Almost always at least two inferences: a planner that picks tools and a synthesizer that composes the answer from results. Some providers fold these into one streamed call with interleaved tool-call markers, but the cost structure is the same.
Parallel dispatch: The model returns multiple tool_calls in one turn. The runtime is responsible for fanning them out concurrently - sequential awaits will silently linearize and triple your latency. Wall-clock = max(tool durations) + 2x LLM time, not sum.
Schema cost: Tool definitions sit in every request until the conversation moves on. 10 tools at ~150 tokens each adds 1.5k input tokens per turn. Prompt caching makes this cheap on hit, expensive on miss. Curate the toolset; do not paste your whole API.
Error handling: A failing tool returns its error to the model rather than to the user. The model often retries with different arguments or apologises and proceeds without it. This is a feature, not a bug - but it means tool errors must be readable to the LLM, not just to humans.

4. Agent loop

Multi-tool, but the runtime loops. After each tool result, the model decides whether it has enough to answer or needs another step. Context grows with every iteration - prior reasoning, tool calls, tool outputs, accumulated observations. Watch the diagram below trace 3 iterations live: parallel tool dispatch in iter 2, a real context compaction event in the middle, and a sentinel done() that fires at the end. The journal underneath builds up as each event happens.

Per-iteration cost: Each loop is one LLM hop + one tool execution. With growing context, late iterations cost more than early ones - the prefill is over a longer prompt every time. A 10-step agent on a 32k context can spend most of its budget on iterations 7-10.
Context growth: The default behaviour is append-only: thought, action, observation, repeat. This works until you hit the context window. Production agents add summarization, eviction, or sub-agents to keep the working set bounded. Compaction is itself an LLM call.
Stop conditions: Explicit "I am done" tokens, a sentinel tool the model calls when finished, an iteration cap, a token-budget cap, or a wall-clock timeout. Belt-and-braces - any single one fails open eventually. Always log which condition fired so you can debug runaway loops.
Why it breaks: Failures appear iterations after the root cause. The model misreads a tool result in iteration 3, builds on that misread in 4 and 5, and produces nonsense in 6. Debugging means replaying the full trace, not just inspecting the final state - structure your logging accordingly.