Context engineering: assembling what your agent sees

Every turn, before your model generates a single token, something decides what the model gets to see. The system prompt, the conversation so far, retrieved memories, tool definitions, the user's latest message — all of it competes for a finite window, and the quality of that assembly puts a hard ceiling on the quality of everything downstream. That assembly discipline is context engineering, and most production agent failures we debug trace back to it, not to the model.

This post walks the whole pipeline: where context comes from, how retrieval decides what gets in, how token budgets force trade-offs, and what the assembly code actually looks like. It is also, deliberately, a showcase — it exercises every interactive primitive in our blog system, including the new scroll-driven walkthroughs, code walkthroughs, glossary terms, and comprehension checks. Expect a higher component density than a normal post; that is the point of this one.

Key takeaways

Context is assembled, not accumulated. Every turn is a fresh selection from sources that always exceed the window.
The pipeline has five stages — collect, select, compress, order, render — and each one is a place quality is won or lost.
Retrieval is a policy decision, not a default: skipping retrieval is often the right call.
Token budgets are engineering budgets. Allocate per source, enforce the allocation, and measure what each slice buys you.
Attention dilution degrades agents long before the window is full — relevance beats volume.

01THE PREMISE

Prompts are static; context is a system

"Prompt engineering" suggests a craft of wording. But in a production agent the system prompt is the only part of the input that stands still. Everything else — history, memories, retrieved documents, tool results — is selected at runtime, per turn, by code you wrote. The is a budgeted surface, and something in your stack is already doing context engineering. The only question is whether it does it deliberately or by accident.

Context engineering is the delicate art and science of filling the context window with just the right information for the next step.
Andrej Karpathy — endorsing the term, June 2025

The failure mode that makes this concrete: an agent that answers perfectly in turn three and confidently contradicts itself in turn thirty. The window was never full. What happened is attention dilution — every irrelevant transcript fragment competes with the relevant ones, and the model's effective precision drops as the noise grows. Relevance beats volume, every time.

02THE PIPELINE

The context assembly pipeline

Context assembly is a pipeline, and naming its stages gives you places to put instrumentation, budgets, and tests. Tap a stage for what it owns:

To see the pipeline run, follow one turn of a support agent as its context window gets built. Scroll through the four states:

Everything competes

The collect stage produces more candidates than the window holds: forty turns of history, a dozen plausible memories, twenty retrieval hits, every tool definition. If you concatenate and truncate, the cut is arbitrary — whatever happened to be last loses.

Select what earns a slot

Selection applies explicit rules: the last N turns of history survive verbatim, memories must clear a similarity threshold, retrieval must be triggered by policy (more on that below). Everything else is rejected before compression — you can't summarize your way out of having selected the wrong things.

Compress the survivors

Old history collapses into a rolling summary; retrieval keeps top-K with the rest dropped, not truncated mid-sentence; verbose tool schemas get trimmed to the fields this workflow uses. Compression is lossy by design — the budget decides how lossy.

Order and render

Order is the cheapest optimization in the pipeline: stable blocks lead so prompt caching works; the live task lands at the end, where models attend most reliably. Render with explicit section markers — a model that can't tell a retrieved memory from a user instruction will eventually follow the wrong one.

03MEMORY TOPOLOGY

Where memories actually live

"Give the agent memory" is four different storage systems wearing one name. The shape that works in production separates them by lifetime and access pattern. Tap a node for its role:

A production memory topology: the agent core reads from three stores with different lifetimes; only the session store is written every turn. Tap a node for detail.

The distinction that matters most is write policy. Session state is written unconditionally; vector memory is written through a promotion gate ("is this worth remembering across sessions?"); the knowledge base is not written by the agent at all. Collapsing those three write policies into one store is how agents end up confidently recalling things that were never true.

04RETRIEVAL

Retrieval is a ranking problem and a policy problem

Recall from vector memory works by : the query is embedded, candidate memories are ranked by distance, and the top-K survive. Try it — switch the query and watch the ranking change:

Top 3 retrieved

Booked Tokyo flights in May0.95
Based in Denver0.94
Speaks basic Japanese0.91

Similarity ranks candidates, but it cannot decide whether to retrieve at all. That is a policy question, and encoding it explicitly beats retrieving on every turn. Explore how the decision shifts:

Retrieve, or answer from context?

Query type

Answer already in recent context?

Optimization mode

SKIPSocial turns gain nothing from retrieval — it adds latency and noise.

The query side of this is a few lines. The discipline is in the threshold and the budget, not the API call:

recall.ts

export async function recallMemories(
  query: string,
  budgetTokens: number
): Promise<MemoryChunk[]> {
  const embedded = await embed(query);
  const hits = await vectorStore.search(embedded, {
    topK: 8,
    minSimilarity: 0.75, // below this, "similar" is coincidence
  });
  return packToBudget(hits, budgetTokens); // whole chunks only, never split
}

05TOKEN BUDGETS

The window is a budget — treat it like one

Allocate the input window per source, and enforce the allocation in code. Here is a working budget for an 8K-input agent — drag the sliders and watch what overruns cost you:

Input budget for one turn (8K target)

System promptTool definitionsMemories + retrievalHistory (summary + recent)Current message + headroom

System prompt1200 tokensTool definitions900 tokensMemories + retrieval1800 tokensHistory (summary + recent)2600 tokensCurrent message + headroom800 tokens

Total: 7300 / 8000 tokens

When the budget forces compression, you have four levers, each with a different fidelity/cost profile. Expand a row for when to use it:

Fidelity	Added cost	Use when
Lowest	None	Prototypes only
Good for narrative	One LLM call per N turns	Long sessions
Exact for what it captures	Schema design + extraction call	Slot-heavy workflows
High, query-dependent	Embedding + scoring pass	Heavy windows, strict budgets

06MEMORY IN ACTION

What the reader feels: memory on vs. off

The difference context engineering makes is easiest to see from the user's side of the chat. Same agent, same questions — toggle the memory knob and step through:

Travel assistant, two sessions after booking

MemoryOn

Click Send to step through the conversation.

Turn 0 / 3

Behind the memory-on script, one turn of the agent looks like this — note that recall is a tool call with a budget, not an ambient superpower:

Trace

"Can you find me a hotel for that trip?"

Thought
"That trip" is unresolved. Check session summary, then long-term memory before asking the user.

Tool callmemory.search142ms

Recall trip-related memories for this user

Input

{
  "query": "upcoming trip booking",
  "top_k": 3,
  "min_similarity": 0.75
}

Output

{
  "hits": [
    {
      "text": "Booked PDX→NRT flights, May 12–19",
      "score": 0.91
    },
    {
      "text": "Vegetarian",
      "score": 0.82
    },
    {
      "text": "Prefers window seats",
      "score": 0.78
    }
  ]
}

Observation
Trip resolved: Tokyo, May 12–19. Dietary preference is relevant to hotel choice; seat preference is not — include the first two, drop the third.

Tool callhotels.search890ms

Search hotels with the resolved constraints

Input

{
  "city": "Tokyo",
  "checkin": "2026-05-12",
  "checkout": "2026-05-19",
  "filters": [
    "vegetarian_breakfast"
  ]
}

Output

{
  "results": 3
}

Final
Answer with three options, citing the remembered dates and preference so the user can correct stale memory.

That last step carries a principle: surface what you recalled. Citing remembered facts back to the user ("for your May 12–19 Tokyo trip") turns stale memory from a silent failure into a correctable one.

07IMPLEMENTATION

The assembly function, line by line

Strip away the vendor SDKs and context assembly is one honest function. Step through how each stage maps to code:

assembleContext.tsts

1export async function assembleContext(2  session: Session,3  userMessage: string,4  budget: TokenBudget5): Promise<ModelInput> {6  // Collect: every candidate source, no filtering yet7  const candidates = {8    memories: await recallMemories(userMessage, budget.memory),9    retrieval: await maybeRetrieve(userMessage, session, budget.retrieval),10    history: session.turns,11  };12 13// Select + compress history: recent turns verbatim, the rest summarized14const recent = candidates.history.slice(-RECENT_TURNS);15const summary = await rollingSummary(16candidates.history.slice(0, -RECENT_TURNS),17session.summary,18budget.summary19);20 21// Order: stable blocks first (cacheable), live task last22return render([23block('system', SYSTEM_PROMPT),24block('tools', toolDefinitions(session.workflow)),25block('memory', candidates.memories, { labeled: true }),26block('retrieval', candidates.retrieval, { labeled: true }),27block('summary', summary),28block('history', recent),29block('user', userMessage),30]);31}

The budget arrives as an argument, not a constant buried in a helper. Every downstream call receives its slice, which makes the allocation testable and per-workflow tunable.

Zoom out one level and the same function sits inside a per-turn lifecycle:

The user message lands and the session record loads: turns, rolling summary, workflow scratchpad.

Nothing model-facing has happened yet — this stage is pure I/O and belongs in ordinary application code.

The contrast with the naive approach is stark enough to put side by side:

Naive: accumulate

Concatenate the full transcript every turn. Retrieval fires always. Tool schemas ship complete. Works in the demo; by turn forty the window is noise, latency has doubled, and the agent forgets the user's first constraint.

Engineered: assemble

Fresh selection every turn against explicit budgets. Retrieval is policy. Old history is summarized, decisions live in a structured scratchpad, and the assembled window is logged so failures are inspectable.

Adopting this in an existing agent is incremental, not a rewrite:

Instrument the window you already build
Log per-source token counts for every turn. Most teams discover one source (usually history or tool schemas) eating over half the window for no measured benefit.
Set budgets and enforce them
Write down the allocation, then make the assembly code enforce it. The first enforcement run usually exposes the silent overruns.
Add a retrieval policy
Replace retrieve-always with the explicit policy decision: skip on chitchat, skip when the answer is in-window, scope by workflow.
Compress, then measure
Introduce rolling summaries and structured extraction, and verify with evals that answer quality held while tokens dropped. Compression you cannot measure is just deletion with extra steps.

How this maps onto specific stacks differs mainly in vocabulary:

Sessions and compaction handle the history side out of the box; memory and retrieval arrive as tools you register. Your assembly logic lives in the system prompt plus tool design — the SDK owns ordering and caching.

You own every stage of the pipeline explicitly — which is more work and also the clearest way to learn it. The assembleContext() function above is the whole architecture.

And the failure modes to expect, because every team hits the same four:

Stale memory wins over the live user

A remembered preference contradicts what the user just said, and the model follows the memory. Fix: order recalled facts before recent turns, label them as fallible, and cite them in answers so users can correct them.

Retrieval noise outranks the conversation

Top-K returns plausible-but-irrelevant chunks and the agent answers from those instead of the user's actual question. Fix: similarity thresholds (not just top-K), re-ranking, and the skip policy.

Summary drift

The rolling summary paraphrases a critical value — an order ID, a date — and the corruption compounds every refresh. Fix: structured extraction for paraphrase-sensitive data; summaries carry narrative, never identifiers.

Cache-hostile assembly

Volatile content (timestamps, memory blocks) interleaved before stable blocks breaks prefix caching, multiplying cost and latency. Fix: the ordering rule — stable first, volatile last.

08CHECK YOURSELF

Check your understanding

An agent answers well early in sessions but degrades badly by turn thirty — long before the context window is full. What is the most likely cause?

Check your understanding

Your rolling summary keeps corrupting order IDs that were stated earlier in the session. Which lever does this post recommend?

09TAKEAWAYS

The discipline in one card

Context engineering is not a model trick; it is an engineering discipline with a pipeline, budgets, policies, and logs. The teams whose agents stay coherent at turn fifty are the ones treating the window as a system.

Context engineering in one card

Budget per source and enforce the budget in code.
Make retrieval a policy with an explicit skip path.
Summaries carry narrative; structured scratchpads carry identifiers.
Order stable-to-volatile: caching at the front, attention at the back.
Log the assembled window — context bugs are invisible without it.

This post focused on a single agent's window. The moment you decompose into multiple agents, context becomes a routing problem too — who sees what, and what crosses the boundary between specialists. That story is the Agent Orchestration series, and the two disciplines compound: good decomposition bounds what each window must hold.

References

Anthropic, Effective context engineering for AI agents
Anthropic, Building Effective Agents
Liu et al., Lost in the Middle: How Language Models Use Long Contexts
OpenAI, A Practical Guide to Building Agents