The Context Engineering Stack

Someone at my talk asked: "How do you actually make effective use of all the context?" It's the right question. Context is the unlock — I've been saying this since Code Owns Truth. The model isn't the differentiator. The context you give it is.

But I've been learning, uncomfortably, that my own practice doesn't fully match the thesis.

#What I Built

At Product.ai we have a shared knowledge base — 1,765 markdown documents. Architectural decisions, team manifests, domain axioms, mission specs. The kind of curated corpus I've been advocating for: write down the physics, give the agent a map, and grep does the rest.

Except grep stopped doing the rest about 800 documents ago.

It's worth understanding what Claude Code's search actually is. Reverse engineering of the system prompt shows what's under the hood: a GrepTool (regex over file contents), a GlobTool (find files by name pattern), and a View tool (read a file). When you ask "how does authentication work," the model extracts keywords from your request — auth, token, login, middleware — and issues grep calls for those literal strings. The model's reasoning decides which keywords to try, but the search itself is just regex over text. No semantic understanding, no index, no ranking. It either matches the string or it doesn't.

Claude Code does have a dispatch_agent that spins up a subagent for search — "when you're not confident you'll find the right match in the first few tries, use the Agent tool." That helps with context pollution (the subagent searches in isolation, only returns what's relevant). But the subagent still only has grep, glob, and read. Better isolation, same primitives.

For known identifiers — function names, class names, import paths — this is fine. For conceptual queries, it falls apart. "How does the billing system handle failed payments" has no single string to grep for. The logic might span three files connected by imports, not shared keywords. And Morph's research puts a number on it: agents spend 60%+ of their time searching, and untrained models take 12+ turns to find what a trained search model finds in 4.

So I built a "context engine" for fun — I'd used LanceDB on a toy project embedding images with CLIP and it seemed like a reasonable experiment to try for text. A local RAG tool that indexes markdown into LanceDB, embeds via Ollama's nomic-embed-text, and serves hybrid search (vector + BM25 with Reciprocal Rank Fusion) over MCP. Point it at your folders, index, add the MCP server to Claude Code — done in five minutes. We added a plugin installer, a /ce-search skill, config-as-code. Clean, lightweight, purpose-built for our org.

Meanwhile, without knowing about my solution, a coworker had set up QMD — Tobi Lütke's local-first semantic search engine — on the same knowledge base. Same problem, different approach.

Then he ran them head-to-head on identical queries. Three queries told the whole story.

#The Benchmark That Changed My Mind

Query 1: "How does [internal service] render and serve pages?"

Our context engine returned tangentially related docs at 67% confidence — files that mentioned the service but weren't about its rendering pipeline.
QMD returned the team lead's manifest at 93% — the person who owns that architecture. Then the rollout plan, the core engineering derivation. It found the people and decisions that own the answer, not just files that contain the keyword.

Query 2: "How does [data collection system] work and what events does it track?"

Both found the right document — the system's architecture spec. But our context engine returned four chunks from the same doc. QMD returned breadth: the spec, a related analytics mission, the domain strategy document. Connected context from across the corpus.

Query 3: "NestJS authentication JWT token refresh flow"

grep -ri "JWT.*refresh\|refresh.*token" → 0 results. The exact phrase doesn't exist in our knowledge base.
Our context engine found the security architecture doc at 69%, then returned it again as a duplicate chunk.
QMD found the same doc at 88%, then the engineer who owns auth at 50%. It found the right system and the right person from a query that grep couldn't touch.

The gap is reranking. Our context engine stops at RRF fusion — statistical merging of vector and keyword results. QMD adds two LLM stages: query expansion (a 1.7B model generates search variants) and neural reranking (a 0.6B cross-encoder scores each result for actual relevance). RRF is statistical. The reranker understands the query.

Fewer, smarter chunks matter too. QMD uses boundary scoring — it targets ~900 tokens per chunk, scores potential break points by heading weight, code block boundaries, blank lines — and picks the best split within range. Our context engine splits on headings with a hard 2,000-char limit. Same 1,765 docs, but QMD produced 49,799 chunks vs. my 86,432. Fewer chunks, less noise, better retrieval.

Our context engine wins on incremental reindex speed (1.1s vs. 5.8s for single file changes), lighter model footprint (274MB Ollama vs. 2.3GB on-device GGUF), and org integration (plugin installer, skills, config-as-code). But I'll be honest: QMD kind of replaces what I built. Retrieval quality is the thing that matters for agents, and QMD is meaningfully better there. The differences are interesting though — reranking vs. RRF, boundary-scored chunking vs. heading splits, 49K chunks vs. 86K from the same docs. Same problem, and the architectural choices that diverged tell you a lot about where retrieval quality actually comes from.

#The Landscape Forming Around This

The shared-knowledge benchmark was a microcosm of what's happening at the tooling level. A stack is forming, and it has three layers:

The crash course — orientation, conventions, constraints. This is the CLAUDE.md. In practice it's part onboarding doc ("this is a TypeScript monorepo, here's how to run tests"), part house rules ("we use Drizzle for ORM, errors follow this pattern"), part map ("the auth system lives in src/lib/auth"), and part hard constraint ("never modify migrations directly"). It's the agent's first day at the company, compressed into a file. Zero latency, zero cost, survives context compaction. And it encodes judgment and taste — things no retrieval system can find because they never existed as documents until someone wrote them down. The three-layer model applies to the constraint part specifically: when those rules are tight, the agent needs less search because the physics are already loaded.

Anthropic added .claude/rules/ — modular rule files that split the monolithic CLAUDE.md into focused pieces (code-style.md, testing.md, security.md). Personal rules in ~/.claude/rules/ apply to every project on your machine; project rules override per-repo. Symlinks let you share common rules across projects. It's the same constraint layer getting more structured — from one big doc to composable, hierarchical modules.

But all of it — CLAUDE.md, rules, manifests — only captures what you knew to write down. It's blind to the file you forgot, the dependency you didn't know existed, the person who owns the answer.

Semantic search — discovery. Cursor's research is the clearest evidence: 12.5% higher accuracy across all frontier models when semantic search is available, with code retention improving 2.6% on large codebases. They trained a custom embedding model on agent traces — analyzing what should have been retrieved earlier in a session, then training the model to surface it sooner. The search learns from how agents actually work.

The tools: QMD for local-first hybrid search with reranking. Nia (YC S25) for hosted indexing of remote codebases and third-party packages — and for context sharing across agents (plan in Cursor, continue in Claude Code, search context comes with you). Augment's Context Engine for the most ambitious approach: 1M+ files, real-time knowledge graph mapping architecture and dependencies, exposed via MCP.

The limitation is structural. Embeddings treat code as flat text. They can't follow a function call from handler.ts to utils/auth.ts to lib/jwt.ts. Google DeepMind proved there's a mathematical ceiling on what embeddings can represent at scale.

Code intelligence — structural indexing. There's a layer between text retrieval and agentic search that deserves its own mention: LSP, the Language Server Protocol. Microsoft built it for VS Code in 2016 to standardize how editors understand code — go-to-definition, find-references, hover for type info, symbol outlines. It's how your IDE knows that validateToken in auth.ts is called from middleware.ts and returns a Promise<User>. Structural understanding, not text matching.

Now agents are getting it too. Claude Code added native LSP tools in December 2025 — go-to-definition, find-references, document symbols. Serena is an MCP server that exposes the same capabilities to any agent. There's a whole plugin marketplace with LSP servers for TypeScript, Rust, Python, Go, Java, and twenty other languages.

This is the code equivalent of what QMD and semantic search do for documents. QMD indexes text — finds the right document by meaning. LSP indexes code structure — follows the actual dependency graph, type system, and call hierarchy. One finds where concepts are discussed. The other follows where functions are called. An agent with both can find the auth documentation and trace the actual token validation path through the codebase.

It's also the most purely symbolic layer in the whole stack — ASTs, type systems, and call graphs are formal structures, not learned approximations. Which makes the neuro-symbolic connection even more direct.

Agentic search — comprehension. This is the new category. The agent searches, reads results, reasons about what it found, searches again. Multi-turn, hypothesis-driven. "Grep for webhook, find the handler in api/webhooks/, read it, see it calls decodeJWT from lib/crypto, follow that import, find the actual validation logic."

The critical insight from Anthropic's engineering: agentic search should run in a subagent with its own context window. When the main coding model searches, every dead-end file stays in context. After five or six exploration turns, the context is polluted. Performance degrades 30%+. Subagent architecture solves this — the search agent explores in isolation, throws away dead ends, returns only the relevant spans. The main model's context stays clean.

This is why Anthropic's multi-agent approach outperformed single-agent Opus by 90%. Not smarter agents — cleaner context.

Trained search models (Morph's WarpGrep) achieve the same retrieval quality in 3.8 steps vs. 12.4 for untrained models — 3x fewer turns, because they fire 8+ parallel tool calls per turn instead of exploring sequentially.

#The Stack

CONSTRAINTS (manual, distilled)   → the physics, always loaded
SEMANTIC SEARCH (indexed)         → discovery, what you didn't know to ask about
AGENTIC SEARCH (multi-turn)       → comprehension, following the thread

Constraints without discovery is blind. Discovery without comprehension is noisy. Comprehension without constraints is aimless.

The part I'm still working out: where does the constraint layer end and the search layer begin? Right now I'm forging axioms through a conscious, explicit process — running research across different foundation models, finding where they diverge, adversarially fusing the results into something I trust. "We use Drizzle for the ORM, Elysia for the API layer, here's the error handling pattern" — but those aren't just typed from memory. They're distilled from deliberate multi-model investigation. But what if the agent could derive those axioms from the codebase, the way Cursor's embedding model learns from agent traces? The constraints would be automatically distilled, continuously updated, and the manual curation I do would become the exception rather than the rule.

That's the version of this that makes my constraint-layer thesis both more powerful and slightly obsolete. The physics still matter. But maybe you don't have to write them by hand forever.

#The Missing Layer: Forgetting Well

There's a dimension I haven't mentioned yet: context management over time. Retrieval is half the problem. The other half is what happens when the context window fills up. I wrote about this in Memory and Journals — Borges's Funes, the man with perfect memory who couldn't think because he couldn't forget. Agents that log everything are agents that understand nothing. The ones that learn to forget well might be the ones that actually think. The same principle applies to retrieval: seeing everything isn't the goal. Seeing the right things is.

A recent teardown of how Codex, Claude Code, and OpenCode handle compaction found three completely different strategies. Codex writes a "handoff summary" — distill everything into a briefing for the next context window, delete the rest. Claude Code uses three-tier progressive forgetting — first trim old tool results (zero LLM cost), then use prompt cache strategies, then a structured 9-section LLM summary as a last resort. OpenCode does timestamp-based message hiding with a 5-heading summary.

The insight: "the best context management isn't about endlessly expanding memory capacity, but learning to forget with precision." That's a compaction version of the same point. It's not about seeing everything. It's about seeing the right things.

This connects to why even crude approaches have value as stopgaps. We have an auto-generated INDEX.md in our shared-knowledge repo — basically an index in a book, a flat listing that gives agents a map before they search. It's honest-to-god crude, and probably obviated by QMD or similar once you have proper retrieval set up. But it tells the agent where to start looking and that reduces wasted search turns, which is better than nothing when you're at 1,765 docs and haven't indexed yet. There's now a codebase-context-spec proposal to standardize this pattern — a .context/index.md at the root of any project — and an empirical study from November 2025 analyzing how developers are actually writing these manifests in practice.

Even Martin Fowler's team is calling file reading and searching "the most basic and powerful context interfaces in coding agents." The industry is converging on the idea that context is the engineering surface.

#The Neuro-Symbolic Thing Nobody's Naming

Here's what I keep circling back to. The constraint layer — CLAUDE.md, axioms, manifests, INDEX.md, the codebase-context-spec proposal — is symbolic. Structured, human-authored, explicit rules. The physics you write down.

Semantic search — embeddings, vector similarity, reranking models — is neural. Learned, fuzzy, pattern-matching. The discovery that finds things you didn't know to ask about.

Agentic search is the hybrid. Neural reasoning over symbolic structures — following imports, reading code, tracing call graphs. An LLM using grep and file-read tools to navigate a codebase is a neural system reasoning about symbolic artifacts. It's the neuro-symbolic part happening in practice without anyone calling it that.

The whole context engineering stack is arguably a neuro-symbolic architecture emerging bottom-up. Nobody designed it as one. But when you layer explicit constraints (symbolic) on top of learned retrieval (neural) on top of reasoning-driven search (hybrid), you've reinvented something AI researchers have been theorizing about for decades — just pragmatically, in the IDE, without the academic framing.

The neuro-symbolic debate in AI has always been about this: pure neural systems (transformers, embeddings) are powerful but opaque and brittle on edge cases. Pure symbolic systems (rule engines, knowledge graphs, formal logic) are precise but can't handle ambiguity or scale to natural language. The whole field has been trying to combine them. And here we are, doing it accidentally, because agents need both grep and semantic search and human-written axioms to function well in a real codebase.

The Memory and Journals connection matters here too. Forgetting is a compression operation — it's the system deciding what to keep in symbolic form (the distilled summary, the axiom, the journal entry) versus what stays in the neural substrate (the raw embeddings, the full conversation history that can be retrieved but doesn't need to be present). Human memory does this automatically. Agent memory systems are reinventing it piece by piece — compaction, tool-result trimming, context windows with sliding summaries — without a unified theory of what they're doing.

Maybe that's fine. Maybe the unified theory isn't necessary and the pragmatic layering is the theory. But it's worth noticing that the "how do I get my coding agent to understand my codebase" problem and the "how do we combine symbolic and neural AI" problem are the same problem wearing different clothes.

#No One Size Fits All

The honest answer to "how do you make effective use of context" isn't a stack diagram. It's: we're running four different approaches simultaneously and still figuring out which one to reach for when.

Our shared repo's INDEX.md gives agents a generated map before they start searching — high-level orientation
Our internal "context engine" gives hybrid RAG search over the corpus — find the right document
QMD adds reranking and smarter chunking — find the right answer within the right document
Claude Code's native grep/glob/subagent pattern follows causal chains across files — understand how things connect

They overlap. They have different failure modes. Grep is instant but dumb. Semantic search is smart but stale. Agentic search is thorough but slow. The manifest is high-signal but manual. None of them alone is sufficient. The combination is the practice.

The three-layer model from Code Owns Truth still holds as a frame — constraints bound the space, prompts express intent, code is truth — but when I zoom into the constraint layer itself, it's not one thing. It's a stack of different retrieval strategies, each with tradeoffs, layered on top of each other and evolving fast.

#The Ground Shifting Underneath

Everything I've described so far is scaffolding built around a constraint: dense attention is quadratically expensive, so we chunk, retrieve, compress, and orchestrate to keep the right context visible without blowing up the cost.

And every layer of that scaffolding has failure modes. RAG preserves semantic similarity but loses position, hierarchy, and reference structure — a chunk may contain the right text while losing why that text matters. Agentic search is thorough but slow, and dead-end explorations pollute context. Compaction preserves gist but drops the specific constraint that governed a later decision. The manifest encodes human judgment but only captures what someone knew to write down. We accept these tradeoffs because the alternative — putting everything in context — has been prohibitively expensive.

What happens when it isn't?

Subquadratic launched this week with their SSA architecture — Subquadratic Sparse Attention — and the claimed numbers are striking. 52× prefill speedup over dense attention at 1M tokens. A 12M token context window. Linear scaling, not quadratic. The architecture doesn't approximate attention or compress state into a fixed-size buffer. It does content-dependent selection: for each query, the model decides which positions in the sequence are worth attending to, computes exact attention over those, and skips the rest.

The distinction from prior attempts matters. Sliding windows and fixed-pattern sparsity gave up content-dependent routing. State space models gave up exact retrieval from arbitrary positions. Hybrids reintroduced dense layers and with them the original cost. SSA's claim is that it doesn't make that trade: linear scaling with content-dependent routing and arbitrary-position retrieval. On MRCR v2 at 1M tokens — the hardest long-context retrieval benchmark, requiring multi-hop evidence integration — SubQ scores 65.9%. That's behind Opus 4.6 (78.3%) and GPT 5.5 (74.0%), but well ahead of GPT 5.4 (36.6%), Opus 4.7 (32.2%), and Gemini 3.1 Pro (26.3%). Not frontier retrieval, but genuinely functional retrieval at a fraction of the compute — and at context lengths where dense attention models either can't run economically or can't actually use the context they accept.

Caveats are real. The full technical report hasn't been released. Weights aren't open. The benchmark selection is narrow — three tests, all emphasizing the long-context retrieval and coding tasks SSA is designed for. There's also a 17-point gap between SubQ's research MRCR score (83) and its third-party verified production score (65.9) that's largely unexplained. The AI research community's reaction has ranged from "genuine breakthrough" to "wait for the technical report." History favors skepticism — Mamba, RWKV, and every prior subquadratic architecture looked promising in papers and underperformed transformers at frontier scale. SSA may be different. It may not be. The honest position is: the direction is clearly right even if this specific implementation needs more proof.

But the direction is what matters for this essay. Every layer of the stack I've been describing — the manifest, the semantic search, the agentic subagent exploration, the compaction strategies — exists partly because putting everything in context was too expensive, and partly because retrieval scaffolding introduces its own failure modes that we've learned to live with. If attention becomes cheap at million-token scale, both pressures change. The economic argument for chunking weakens. And the failure modes of retrieval — lost hierarchy, semantic drift, fragmented reasoning — become avoidable rather than tolerable.

Not entirely, though. Even if you can attend to 12M tokens subquadratically, should you? The Funes argument from Memory and Journals holds on cognitive grounds even when cost goes to zero. Total recall without compression still produces noise. The constraint layer encoding human judgment — the CLAUDE.md, the axioms, the manifests — still matters, because it's not about what the model can see, it's about what the model should prioritize. And forgetting-as-meaning is a cognition problem, not a cost problem.

But the engineering argument — "we chunk because we have to" — gets weaker. And that means the scaffolding layer shifts from load-bearing infrastructure to optional optimization. The manifest still matters. Semantic search still helps. But they become choices rather than necessities, and the failure modes we've been routing around become the strongest argument for keeping them.

The stack doesn't collapse. But the reason each layer exists changes — from "we can't afford to see everything" to "we shouldn't want to see everything, but now we're choosing rather than being forced."

The differentiator was never the model. It was always what the model could see. But what the model should see depends on the question, the scale, and the moment — and no single tool gets that right yet. The architecture layer is moving fast enough that some of those tools may become optional before they become mature. That's not a reason to stop building them. It's a reason to hold them loosely.