title: Context Engineering Is the New Prompt Engineering type: Framework date: 2026-04-04 excerpt: The shift from "write better prompts" to "design better context" — and why this reframe changes everything about how you build with LLMs.

Context Engineering Is the New Prompt Engineering

The standard mental model for working with LLMs goes like this: write a good prompt, and the model will do the right thing.

This is incomplete. The prompt is just one ingredient in the context window. Everything the model sees — the instructions, the examples, the retrieved documents, the conversation history, the tool outputs — shapes what it produces.

If you optimize only the prompt and ignore everything else, you're ignoring 80% of what shapes the model's behavior.

The reframe: context engineering — designing the entire context window strategically, not just writing better prompts.

What Context Engineering Means

Context engineering is the practice of deliberately structuring what goes into the model's context window to maximize quality, consistency, and efficiency.

It includes:

Instruction placement. Where does the system prompt go? Before or after examples? At the start or end of the context?
Example selection. Which few-shot examples do you include? How many? In what order?
Retrieval quality. If you're using RAG (Retrieval Augmented Generation), which documents do you pull? Are they relevant? Are there too many?
History compression. If this is a multi-turn conversation, how much history do you include? Do you summarize or truncate?
Output scaffolding. Do you ask the model to think step-by-step? Do you provide a template for the response format?

All of these decisions influence what the model outputs. Most teams optimize one or two and ignore the rest.

Why Prompt Engineering Alone Hits a Ceiling

You can make the prompt crystal clear, perfectly written, and beautiful. But if your context window is full of irrelevant information, the model can't focus.

You can write examples into the prompt. But if you chose the wrong examples, they're teaching the model the wrong behavior.

You can add instructions. But if you've already used half the context window on RAG results, the instructions get lost in the noise.

This is the ceiling: a well-written prompt in a poorly designed context window still underperforms.

The teams I've seen with the best LLM systems didn't necessarily have the most creative prompts. They had carefully engineered context windows.

The 5 Levers of Context Design

Here are the main levers you can pull:

1. Instruction Placement

Where you put instructions in the context matters.

At the start. Traditional. The model reads instructions first. Problem: by the time it gets through examples and context, it might have forgotten.

At the end. Less common. Recency bias. The model's last instruction is "answer this question." This often works better.

Interspersed. Instructions mixed with examples. "Here's an instruction. Here's an example of what that looks like. Here's another instruction."

Bracketed. Put instructions in a special format: <instruction>Do X</instruction>. This helps the model parse them as separate from content.

A simple test: put your system prompt at the end of the context instead of the start. Measure quality. Many systems see 10-20% improvements just from reordering.

2. Example Selection

Few-shot examples are powerful. But which examples?

Quality over quantity. One perfect example beats ten mediocre ones. Most teams include too many examples.

Diversity. Examples should cover the range of things you care about. If all your examples are happy-path cases, the model learns to ignore edge cases.

Similarity. Include examples that are similar to the query you're actually trying to answer. If you're classifying emails and your query is about spam, include spam classification examples.

Order. Primacy and recency effects exist. The first example is remembered strongly. The last example is remembered strongly. Middle examples blur together. Put your most important examples first and last.

Hardness progression. Start with easy examples, progress to harder ones. This seems to help the model learn better.

Example engineering is underrated. I've seen teams improve quality 15-30% just by changing which examples they include, without touching the prompt.

3. Retrieval Quality

If you're using RAG, retrieval is the biggest lever.

Relevance. The retrieved documents should be relevant to the query. A retriever that pulls random documents is worse than useless — it gives the model bad information.

Quantity. How many documents do you retrieve? 1? 10? 100? There's a sweet spot. Usually 3-5 high-quality documents beats 20 mediocre ones. The model gets distracted by volume.

Freshness. Old documents create hallucinations. "This document says X" but X is outdated. The model trusts the document.

De-duplication. If multiple retrieved documents say the same thing, you're wasting context. Dedupe them.

Ranking. The order you present documents matters. Put the most relevant first. Models are biased toward earlier content.

Spending engineering effort on retrieval quality pays dividends. A 10% improvement in retrieval quality often means a 10-20% improvement in model quality.

4. History Compression

In multi-turn conversations, context grows. At some point, you run out of space.

Truncation. Just keep the last N turns. Simple but lossy. The model forgets early context.

Summarization. Use a separate model to summarize the conversation history. "The user asked about X, we discussed Y, they chose option Z." Then include the summary instead of the full history. Saves tokens, preserves meaning.

Hierarchical. Keep recent turns in full, compress older turns. Last 10 turns in full, turns 11-30 as summary, turns 31+ as single line.

Selective. Keep only the turns that are relevant to the current query. "The user mentioned they use Python" is relevant to a coding question. "The user said the weather is nice" is not.

Most systems do basic truncation. They'd see quality improvements by adding summarization.

5. Output Scaffolding

How you ask for output shapes what you get.

Chain-of-thought. Ask the model to think step-by-step. "Before answering, think through your reasoning." This often helps with reasoning tasks.

Templates. Provide a format: "Answer in this format: {reason: ..., conclusion: ..., confidence: ...}" The model is more likely to follow the format.

Constraints. "Answer in one sentence." "Cite your sources." "Explain in terms a 10-year-old would understand."

Token budget. "Keep your answer under 100 tokens." This is both a constraint and a performance optimization (fewer output tokens = lower cost).

Output scaffolding is cheap (uses a bit of context but saves a lot of fixing bad outputs) and often high-impact.

Practical Examples from Production Systems

Example 1: Customer Support Chatbot

Before context engineering:

Large system prompt (500 tokens): "You are a helpful customer support agent. Answer questions about our products. Be polite. Try to resolve issues. If you can't, escalate to a human."
Large context window (3000 tokens): customer history, product docs, FAQ
Model outputs: long-winded, sometimes incorrect, inconsistent

After context engineering:

Concise system prompt (100 tokens): "Answer this question about our products. Be brief."
Carefully selected context (1500 tokens): only relevant product docs, no history
Few-shot examples (300 tokens): 3 good examples of questions + answers
Output template (100 tokens): "Answer: [answer]. Sources: [sources]."
Model outputs: short, correct, consistent

Result: 20% quality improvement, 40% cost reduction (fewer tokens in context), 30% latency improvement.

Example 2: Code Generation

Before:

Prompt: "Write Python code to do X"
Context: full repo history, lots of examples
Output: code that works on simple cases, fails on edge cases

After:

Prompt: "Write Python code that..."
Context: 3 carefully chosen examples of similar code, style guide
Output scaffolding: "Code:\n``python\n...\n``\nTests:\n...\nExplain:..."
Model outputs: more robust code, includes tests, explains reasoning

Result: 25% reduction in bugs found by users.

Example 3: Research Summary

Before:

Prompt: "Summarize this research paper"
Context: full paper (10,000+ tokens)
Output: incoherent summary, misses key findings

After:

Prompt: "Summarize..."
Context: abstract (500 tokens), introduction (500 tokens), conclusions (500 tokens)
Examples: 2 examples of good summaries
Output template: "Key findings: ..., Implications: ..., Limitations: ..."
Model outputs: clear, accurate, structured

Result: 50% better summaries (measured by human rating).

Why Context Engineering Changes Everything

Once you start thinking about context engineering, you realize:

Prompt quality matters less than context quality. A mediocre prompt in a well-designed context beats a brilliant prompt in a chaotic context.
You have massive leverage. Changing the prompt might improve quality 5%. Reordering context might improve it 20%. Improving retrieval might improve it 30%. The context window is where the leverage lives.
Optimization compounds. Each lever — instruction placement, examples, retrieval, history, output scaffolding — are independent. Improve one, then improve the next. They stack.
You need measurement. You can't optimize context without measuring impact. Run A/B tests on context changes the same way you'd test code changes.
Scaling works. A well-engineered context window scales. It works for 100 queries or 100M queries. A prompt hack scales only until it doesn't.

This is why the best LLM systems are not built by people who are great at writing prompts. They're built by people who systematically engineer every aspect of the context window.

Getting Started

If you want to improve your LLM system, start here:

Map your context window. What goes in? Instructions, examples, retrieved data, history, output scaffolding? Measure how much space each takes.
Measure baseline quality. Test the system on a representative set of queries. Measure quality, cost, latency.
Pick one lever. Start with retrieval (if you use RAG) or example selection. These usually have the highest ROI.
Experiment. Change that lever. Measure impact.
Iterate. Once that's optimized, pick the next lever.

You don't need to be clever about prompts. You need to be systematic about context. That's the edge.