Context engineering: assembling what your agent sees
Your agent is only as good as what it sees. A practical tour of context assembly: memory, retrieval, token budgets, and the pipeline that joins them.
Every turn, before your model generates a single token, something decides what the model gets to see. The system prompt, the conversation so far, retrieved memories, tool definitions, the user's latest message — all of it competes for a finite window, and the quality of that assembly puts a hard ceiling on the quality of everything downstream. That assembly discipline is context engineering, and most production agent failures we debug trace back to it, not to the model.
This post walks the whole pipeline: where context comes from, how retrieval decides what gets in, how token budgets force trade-offs, and what the assembly code actually looks like. It is also, deliberately, a showcase — it exercises every interactive primitive in our blog system, including the new scroll-driven walkthroughs, code walkthroughs, glossary terms, and comprehension checks. Expect a higher component density than a normal post; that is the point of this one.
Key takeaways
- Context is assembled, not accumulated. Every turn is a fresh selection from sources that always exceed the window.
- The pipeline has five stages — collect, select, compress, order, render — and each one is a place quality is won or lost.
- Retrieval is a policy decision, not a default: skipping retrieval is often the right call.
- Token budgets are engineering budgets. Allocate per source, enforce the allocation, and measure what each slice buys you.
- Attention dilution degrades agents long before the window is full — relevance beats volume.
Prompts are static; context is a system
"Prompt engineering" suggests a craft of wording. But in a production agent the system prompt is the only part of the input that stands still. Everything else — history, memories, retrieved documents, tool results — is selected at runtime, per turn, by code you wrote. The is a budgeted surface, and something in your stack is already doing context engineering. The only question is whether it does it deliberately or by accident.
Context engineering is the delicate art and science of filling the context window with just the right information for the next step.
The failure mode that makes this concrete: an agent that answers perfectly in turn three and confidently contradicts itself in turn thirty. The window was never full. What happened is attention dilution — every irrelevant transcript fragment competes with the relevant ones, and the model's effective precision drops as the noise grows. Relevance beats volume, every time.
The context assembly pipeline
Context assembly is a pipeline, and naming its stages gives you places to put instrumentation, budgets, and tests. Tap a stage for what it owns:
To see the pipeline run, follow one turn of a support agent as its context window gets built. Scroll through the four states:
01
Everything competes
The collect stage produces more candidates than the window holds: forty turns of history, a dozen plausible memories, twenty retrieval hits, every tool definition. If you concatenate and truncate, the cut is arbitrary — whatever happened to be last loses.
02
Select what earns a slot
Selection applies explicit rules: the last N turns of history survive verbatim, memories must clear a similarity threshold, retrieval must be triggered by policy (more on that below). Everything else is rejected before compression — you can't summarize your way out of having selected the wrong things.
03
Compress the survivors
Old history collapses into a rolling summary; retrieval keeps top-K with the rest dropped, not truncated mid-sentence; verbose tool schemas get trimmed to the fields this workflow uses. Compression is lossy by design — the budget decides how lossy.
04
Order and render
Order is the cheapest optimization in the pipeline: stable blocks lead so prompt caching works; the live task lands at the end, where models attend most reliably. Render with explicit section markers — a model that can't tell a retrieved memory from a user instruction will eventually follow the wrong one.
Where memories actually live
"Give the agent memory" is four different storage systems wearing one name. The shape that works in production separates them by lifetime and access pattern. Tap a node for its role:
- Agent
- Agent core → Session state
- Agent core → Vector memory
- Agent core → Knowledge base
- Per-session
- Long-lived
- Long-lived
The distinction that matters most is write policy. Session state is written unconditionally; vector memory is written through a promotion gate ("is this worth remembering across sessions?"); the knowledge base is not written by the agent at all. Collapsing those three write policies into one store is how agents end up confidently recalling things that were never true.
Retrieval is a ranking problem and a policy problem
Recall from vector memory works by : the query is embedded, candidate memories are ranked by distance, and the top-K survive. Try it — switch the query and watch the ranking change:
Top 3 retrieved
- Booked Tokyo flights in May0.95
- Based in Denver0.94
- Speaks basic Japanese0.91
Similarity ranks candidates, but it cannot decide whether to retrieve at all. That is a policy question, and encoding it explicitly beats retrieving on every turn. Explore how the decision shifts:
Retrieve, or answer from context?
Query type
Answer already in recent context?
Optimization mode
The query side of this is a few lines. The discipline is in the threshold and the budget, not the API call:
export async function recallMemories(
query: string,
budgetTokens: number
): Promise<MemoryChunk[]> {
const embedded = await embed(query);
const hits = await vectorStore.search(embedded, {
topK: 8,
minSimilarity: 0.75, // below this, "similar" is coincidence
});
return packToBudget(hits, budgetTokens); // whole chunks only, never split
}The window is a budget — treat it like one
Allocate the input window per source, and enforce the allocation in code. Here is a working budget for an 8K-input agent — drag the sliders and watch what overruns cost you:
Input budget for one turn (8K target)
When the budget forces compression, you have four levers, each with a different fidelity/cost profile. Expand a row for when to use it:
| Strategy | Fidelity | Added cost | Use when |
|---|---|---|---|
| Lowest | None | Prototypes only | |
| Good for narrative | One LLM call per N turns | Long sessions | |
| Exact for what it captures | Schema design + extraction call | Slot-heavy workflows | |
| High, query-dependent | Embedding + scoring pass | Heavy windows, strict budgets |
What the reader feels: memory on vs. off
The difference context engineering makes is easiest to see from the user's side of the chat. Same agent, same questions — toggle the memory knob and step through:
Travel assistant, two sessions after booking
Click Send to step through the conversation.
Behind the memory-on script, one turn of the agent looks like this — note that recall is a tool call with a budget, not an ambient superpower:
Trace
"Can you find me a hotel for that trip?"
- Thought
"That trip" is unresolved. Check session summary, then long-term memory before asking the user.
- Tool call
memory.search142msRecall trip-related memories for this user
Input
{ "query": "upcoming trip booking", "top_k": 3, "min_similarity": 0.75 }Output
{ "hits": [ { "text": "Booked PDX→NRT flights, May 12–19", "score": 0.91 }, { "text": "Vegetarian", "score": 0.82 }, { "text": "Prefers window seats", "score": 0.78 } ] } - Observation
Trip resolved: Tokyo, May 12–19. Dietary preference is relevant to hotel choice; seat preference is not — include the first two, drop the third.
- Tool call
hotels.search890msSearch hotels with the resolved constraints
Input
{ "city": "Tokyo", "checkin": "2026-05-12", "checkout": "2026-05-19", "filters": [ "vegetarian_breakfast" ] }Output
{ "results": 3 } - Final
Answer with three options, citing the remembered dates and preference so the user can correct stale memory.
That last step carries a principle: surface what you recalled. Citing remembered facts back to the user ("for your May 12–19 Tokyo trip") turns stale memory from a silent failure into a correctable one.
The assembly function, line by line
Strip away the vendor SDKs and context assembly is one honest function. Step through how each stage maps to code:
export async function assembleContext( session: Session, userMessage: string, budget: TokenBudget): Promise<ModelInput> { // Collect: every candidate source, no filtering yet const candidates = { memories: await recallMemories(userMessage, budget.memory), retrieval: await maybeRetrieve(userMessage, session, budget.retrieval), history: session.turns, }; // Select + compress history: recent turns verbatim, the rest summarizedconst recent = candidates.history.slice(-RECENT_TURNS);const summary = await rollingSummary(candidates.history.slice(0, -RECENT_TURNS),session.summary,budget.summary); // Order: stable blocks first (cacheable), live task lastreturn render([block('system', SYSTEM_PROMPT),block('tools', toolDefinitions(session.workflow)),block('memory', candidates.memories, { labeled: true }),block('retrieval', candidates.retrieval, { labeled: true }),block('summary', summary),block('history', recent),block('user', userMessage),]);}The budget arrives as an argument, not a constant buried in a helper. Every downstream call receives its slice, which makes the allocation testable and per-workflow tunable.
Zoom out one level and the same function sits inside a per-turn lifecycle:
The user message lands and the session record loads: turns, rolling summary, workflow scratchpad.
Nothing model-facing has happened yet — this stage is pure I/O and belongs in ordinary application code.
The contrast with the naive approach is stark enough to put side by side:
Concatenate the full transcript every turn. Retrieval fires always. Tool schemas ship complete. Works in the demo; by turn forty the window is noise, latency has doubled, and the agent forgets the user's first constraint.
Fresh selection every turn against explicit budgets. Retrieval is policy. Old history is summarized, decisions live in a structured scratchpad, and the assembled window is logged so failures are inspectable.
Adopting this in an existing agent is incremental, not a rewrite:
Instrument the window you already build
Log per-source token counts for every turn. Most teams discover one source (usually history or tool schemas) eating over half the window for no measured benefit.
Set budgets and enforce them
Write down the allocation, then make the assembly code enforce it. The first enforcement run usually exposes the silent overruns.
Add a retrieval policy
Replace retrieve-always with the explicit policy decision: skip on chitchat, skip when the answer is in-window, scope by workflow.
Compress, then measure
Introduce rolling summaries and structured extraction, and verify with evals that answer quality held while tokens dropped. Compression you cannot measure is just deletion with extra steps.
How this maps onto specific stacks differs mainly in vocabulary:
Sessions and compaction handle the history side out of the box; memory and retrieval arrive as tools you register. Your assembly logic lives in the system prompt plus tool design — the SDK owns ordering and caching.
Context assembly becomes a graph node ahead of the model node. State channels carry the session scratchpad; the checkpointer is your session store. The budget object travels in graph state.
You own every stage of the pipeline explicitly — which is more work and also
the clearest way to learn it. The
assembleContext() function above is the whole architecture.
And the failure modes to expect, because every team hits the same four:
Stale memory wins over the live user
A remembered preference contradicts what the user just said, and the model follows the memory. Fix: order recalled facts before recent turns, label them as fallible, and cite them in answers so users can correct them.
Retrieval noise outranks the conversation
Top-K returns plausible-but-irrelevant chunks and the agent answers from those instead of the user's actual question. Fix: similarity thresholds (not just top-K), re-ranking, and the skip policy.
Summary drift
The rolling summary paraphrases a critical value — an order ID, a date — and the corruption compounds every refresh. Fix: structured extraction for paraphrase-sensitive data; summaries carry narrative, never identifiers.
Cache-hostile assembly
Volatile content (timestamps, memory blocks) interleaved before stable blocks breaks prefix caching, multiplying cost and latency. Fix: the ordering rule — stable first, volatile last.
Check your understanding
Check your understanding
An agent answers well early in sessions but degrades badly by turn thirty — long before the context window is full. What is the most likely cause?
Check your understanding
Your rolling summary keeps corrupting order IDs that were stated earlier in the session. Which lever does this post recommend?
The discipline in one card
Context engineering is not a model trick; it is an engineering discipline with a pipeline, budgets, policies, and logs. The teams whose agents stay coherent at turn fifty are the ones treating the window as a system.
Context engineering in one card
- Budget per source and enforce the budget in code.
- Make retrieval a policy with an explicit skip path.
- Summaries carry narrative; structured scratchpads carry identifiers.
- Order stable-to-volatile: caching at the front, attention at the back.
- Log the assembled window — context bugs are invisible without it.
This post focused on a single agent's window. The moment you decompose into multiple agents, context becomes a routing problem too — who sees what, and what crosses the boundary between specialists. That story is the Agent Orchestration series, and the two disciplines compound: good decomposition bounds what each window must hold.
References
- Anthropic, Effective context engineering for AI agents
- Anthropic, Building Effective Agents
- Liu et al., Lost in the Middle: How Language Models Use Long Contexts
- OpenAI, A Practical Guide to Building Agents