The Loop You Were Already In

This comes from twelve years of building production systems at Product.ai and six months of running multi-agent coding workflows daily — where the shift from 'I prompt the agent' to 'I design the system that prompts the agent' happened gradually and then all at once.

This month I watched a workflow spawn an entire team — thirty subagents, each on the latest Opus — to fix a single type error. Every one of them re-acquired context the orchestrator already held, then explored independently with zero knowledge of what the others had tried. It exhausted my Claude account quota for the rest of the day. The fix was a one-line import. I killed it and wrote a constraint: if the orchestrator already holds the context and the work looks mechanical, do it inline — don’t spawn.

That constraint is still in the file. It's one of eighteen now. Each one exists because something failed in a way I didn't anticipate, and I decided the system should never fail that way again.

When Peter Steinberger posted two sentences last week, the vocabulary clicked:

"...you shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

Boris Cherny, head of Claude Code at Anthropic, said the same thing on stage a few days earlier: "I don't prompt Claude anymore. I have loops running. They're the ones prompting Claude and figuring out what to do. My job is to write loops."

Addy Osmani formalized it. A dozen Substacks published guides. The SEO farms had their "What Is Loop Engineering?" posts up before Friday. Reddit replied: "It's just a while loop." Geoffrey Huntley's ralph proved the bare while-loop worked in 2025; Claude Code's /goal is that, productized.

They're not wrong. But they're not as right as they think they are.

#The Shift I Didn't Name

Here's what my workflow actually looks like now. I spot a problem — a failing test, a performance regression, an architectural pattern that's drifting. I don't open a terminal and start prompting the coding agent. I write a spec: what the goal is, what the constraints are, what "done" looks like in testable terms. Sometimes I write the spec myself. More often I think it through with an AI as a thought partner — I'm still prompting, but at the planning layer, not the execution layer. Then I hand that spec to an orchestrator who diagnoses the approach, picks the right tools, and dispatches the work to a coding agent running in a sandboxed environment. The coding agent writes code, runs tests, reads the errors, fixes, runs again. When it finishes, the result comes back up the chain. I review, approve or redirect, and move on.

At no point did I type "please fix the auth middleware." The prompting didn't disappear — it moved up the stack. From "write this code" to "help me think about what this should be" to a structured spec that the system executes inside constraints. My job was designing those constraints and making the judgment calls that shaped the spec.

Same pattern for our knowledge pipeline — I wrote validation specs and quality thresholds instead of research prompts, the forge runner dispatched across providers overnight, and I reviewed clean output over coffee. Different domain, same shift.

I'd been running this pattern for months without a name. And honestly, it is just a while loop. But naming it sharpened how I thought about what I was already doing. The label made the structure visible.

The three-layer model I proposed earlier this year maps directly onto why this works:

CONSTRAINTS → bounds the mutation space
PROMPTS → expresses intent, inherently fuzzy
CODE → source of truth, deterministic, testable

A loop is these three layers cycling autonomously. The constraint layer (rules, permissions, stop conditions, protected paths) bounds what the agent can do. The prompt layer (the goal definition, the task spec) tells it what to try. The code layer (tests, linters, build checks) decides whether it actually worked. Constraints bound variance. Prompts express intent. Code owns truth. The loop repeats until truth is satisfied or the constraints say stop.

#The Tick

Every loop, at every frequency, runs the same three-move cycle:

Act. The agent does something: writes code, researches a market, drafts a strategy, triages a backlog, scans for patterns. What it can do depends on its tools (filesystem, shell, browser, APIs; a loop without tools is a model guessing in a circle) and what it can see (the context it retrieves, the memory it carries, the constraints that bound it). I wrote about this in The Context Engineering Stack: the retrieval quality at each step determines the loop's ceiling. And forgetting well matters as much as retrieving well — Funes the Memorious had perfect memory and couldn't think.

Verify. Something other than the agent checks whether the action worked. Tests, linter, build success, a separate evaluator model, a human reviewing a PR, production metrics trending the right direction — the signal ranges from binary to fuzzy, from seconds to weeks. But it has to exist. Without verification, the agent has no way to know if it succeeded, and the loop spins instead of tightening.

Decide. Based on verification, the loop branches:

Converged — goal met. Stop.
Recoverable error — bad syntax, failing test with a clear signal. Iterate: go back to Act with a different approach.
Hard blocker — missing credentials, ambiguity that requires judgment. Escalate to a human or an outer loop.
Budget exceeded — max iterations, token ceiling, cost limit. Halt, even if unfinished.
No progress — same error on repeat, no variation. A loop that retries the exact same action after the exact same error isn't iterating; it's stuck. If a human engineer spent four hours on one error without escalating, you'd call it a performance issue. Loops need the same instinct.

That's the anatomy. Act → Verify → Decide, repeat until done or halted. The ReAct pattern from the Princeton/Google research, now with real tooling underneath it.

What changes across loop frequencies isn't the cycle — it's what each move contains. A tight loop's Act is "write a function." A medium loop's Act is "triage the backlog." A long loop's Act is "reassess the strategy." Verification ranges from binary (tests pass) to lagging and noisy (six weeks of retention data). Decide ranges from "retry with a different approach" to "change the roadmap." Same tick. Different contents. Different stakes.

But don't let the fix-and-repair examples mislead you about what goals can drive the tick. The most powerful loops are generative: implement a feature from a job-to-be-done, build an onboarding flow from a spec against acceptance criteria, analyze a dataset and return a cited verdict. A roadmap item with clear exit conditions is a loop goal. A user story with testable outcomes is a loop goal. The pattern works anywhere the definition of done can be made concrete enough for something to check, and that's a much larger surface than passing tests.

#What It's Not

It's not full autonomy. A loop without budgets and escalation paths is an expensive accident. Token costs scale with iteration count, and the choice between "token-rich" loops (full context every turn) and "token-poor" loops (compressed summaries) is already an active design decision with real cost tradeoffs. You're still designing the track.

It's not a replacement for prompt engineering. Prompts are still inside the loop — they become subagent definitions, skill specs, task descriptions. But the leverage moved. The prompt is one component; the context you engineer around it is what determines whether the loop converges or spins.

It's not new computer science. Reddit's right: structurally, it is a while loop. The tooling is what changed — it makes the pattern practical without custom bash scripts and duct tape.

And it's not free. A loop without economic circuit breakers isn't converging — it's an infinite loop with a credit card attached.

And it's not a long agent session. An agent running for four hours with dynamic workflows and parallel subagents is impressive, but duration and complexity don't make it a loop. A loop recurs: each cycle's output feeds the next cycle's input. What drives the recurrence varies (a cron schedule, a CI trigger, a webhook, the human operator opening a new session tomorrow morning), but the next cycle exists and inherits from the last one. It verifies automatically: something other than the agent checks whether the cycle converged. It persists state: what happened last time is available next time. It terminates or escalates on its own: knows when to stop, when to ask for help, when the goal is met. Without recurrence and automated feedback closing the signal, you have a powerful session, not a loop. The compounding value (the ratchet, the constraint accumulation) comes from iteration across cycles, not computation within one.

#Loops Nest

Everyone's talking about the tight loop. Agent acts, verifies, adjusts, repeats. In engineering that's write code, run tests, read error, fix. In research it's gather sources, evaluate against a rubric, fill gaps. In strategy it's draft a plan, run adversarial review, revise. Minutes per iteration. Fast feedback, narrow scope, clear exit condition. Claude Code's /goal "all tests pass" is one version — but the pattern isn't specific to code.

But that's one frequency. Loops nest. And the nesting is the architecture.

The tick — Act → Verify → Decide — is the same at every frequency. What changes is the primitives each frequency needs around it: the connective tissue between ticks, between loops, and between the system and the humans who own it.

Tight loops iterate in minutes toward a defined goal. The function is convergence. The primitives are atomic: a task (spec carrying goal, context, and exit condition), tools (filesystem, shell, APIs), verification (a separate evaluator checking each turn so the agent isn't grading its own homework — Claude Code ships this as /goal), and isolation (worktrees, sandboxes, branches for parallel work).

Isolation prevents write collisions, but the harder problem is coordination. Last sprint, one agent refactored our promotion model interface while another built a feature depending on the old shape. Both passed their tests. The merge failed. A merge failure from parallel agents isn't a bug in either agent — it's a missing coordination primitive, and the honest answer is it's still mostly unsolved. The workarounds are blunt: serialize work that touches shared interfaces, or accept that the outer loop's job includes resolving conflicts between its children.

The medium loop is where the unsolved problems live. Hours to days. The function is adaptive planning: scanning for work, triaging, assigning to agents, reviewing what comes back, re-evaluating the plan against what was actually built each cycle. A PR might pass every test and still have bad architecture. An issue might be technically solvable and strategically wrong to solve right now. At Product.ai, our overnight ontology research fans out across five providers and I review clean output over coffee — but which categories to prioritize next still reshapes after each run.

The medium loop introduces primitives the tight loop didn't need:

Cadence: when and why the loop runs again. A tight loop's cadence is continuous; the medium loop's is scheduled or event-driven. A daily triage cron. A CI trigger. A webhook that fires when a PR lands. The skeptic's objection ("cronjobs have funny rebranding") is half right: the scheduling layer is cron. What cron never had is the decision logic in the body. A cron job runs a fixed script. A loop-on-cron runs a model that observes current state, decides what to do, acts, and checks whether it worked. Claude Code's /loop is session-scoped cadence (watch-style, re-fires on an interval while you're in the conversation). For scheduling that runs independently (laptop closed, across sessions), there are Desktop scheduled tasks, Cloud Routines, GitHub Actions, plain cron: anything that can trigger an agent run without a human in the chair.

Artifact: the output that makes the loop's state legible to the human at the gate. An HTML report with color-coded severity. A rendered diff with inline annotations. Thariq Shihipar at Anthropic made the case that HTML has replaced Markdown as the right output format for agent work — because when you're reading agent output, not editing it, you want spatial layout, embedded diagrams, interactive controls. A 200-line Markdown dump is where the human loses the thread.

Memory: what persists between ticks and across sessions. AGENTS.md, state files, markdown logs, progress trackers, the repo itself. Osmani's line: "The agent forgets, the repo doesn't." The ratchet lives here — the accumulated constraints that encode every past failure.

Escalation: the handoff mechanism. PR reviews, triage inboxes, Slack notifications, the ping that says "I'm stuck, please look." How the inner loop hands control to the outer loop. Without escalation, a stuck agent burns tokens instead of asking for help.

The medium loop's verification goes beyond PR review into production sensing. Error tracking (Sentry, Datadog) surfacing regression spikes after deploys. APM data showing latency percentiles creeping. Canary deploys and feature flags giving you a structured before/after signal on real traffic. Beta users catching the things test suites can't: the flow that's technically correct but feels wrong, the edge case that only appears with real data. This data already exists in most production systems; the underexplored move is feeding it into the loop as structured signal. Sentry's already doing it. Their Autofix agent (Seer) ingests error events, correlates them with your codebase, and can open a PR — but the interesting part isn't the fix. It's how they solve the budget problem: a fixability score. Before any expensive inference runs, a cheap ML model scores every issue 0.0–1.0 on how likely it is to be automatically fixable. Below the threshold, you get root cause analysis and nothing more. Above it, the system plans a fix, generates code, opens a PR. The response scales with confidence, not with error volume. That's the pattern the medium loop needs: don't commit Opus-tier compute to every signal. Score it first, tier the response, let the constraint layer decide how far automation goes.

Long loops run over weeks or months. Sensing, evaluating, orienting. Which patterns keep failing? Which architectural decisions are calcifying? Are the constraints still right or have they become cargo cult?

I'm calling all three frequencies loops, but by my own definition in "What It's Not" (recurrence, automated verification, persistent state, self-termination), only the tight loop and a couple of narrow medium loops fully qualify. The long loop is one I want to exist more than one that does. No test runner evaluates whether your product direction is correct. A retention metric tells you in six weeks if the decision was right, and even then you're guessing at causality. The cadence is mostly human-initiated: quarterly reviews, post-mortems, the moment when metrics start telling a story you didn't expect. No mechanism yet for the strategic layer — just attention.

The long loop adds two final primitives. Constraint (rules files, hooks, permission boundaries, protected paths) bounds what the loop can do wrong. I spent two weeks writing nothing but axioms and constraint specs before letting an agent run unsupervised; that investment is still paying returns. Observability (logs, traces, cost tracking) lets you diagnose failures in the loop system itself. When a tight loop burns through its budget on a single error, the failure isn't in the agent's code; it's in the loop's termination logic. A system you can't debug is a system you can't trust past toy scale.

The relationship between loop frequency and human involvement isn't incidental. It's structural:

$The loop fractal: tight loops nested inside medium loops inside long loops, with constraints propagating inward and human judgment increasing outward$

Long loop (weeks/months): "Orient — what should we build and why?"
  └─ Medium loop (hours/days): "Replan — what's next, given what we just learned?"
      └─ Tight loop (minutes/hours): "Converge — write, test, fix, repeat"

The tighter the loop, the more verification is binary and automatable. The longer the loop, the more verification depends on causal interpretation: judgment, context, experience, taste. The things models can't do yet and might not do soon. Human judgment doesn't leave the system when you adopt loop engineering. It moves up the stack. You stop being the person who reads each error message and start being the person who decides what "done" means, what's worth building, when the constraints need updating, and when the loops should escalate back to you.

Osmani's canonical framework names five blocks plus state: automations, worktrees, skills, plugins/connectors, sub-agents, and state. Mapped to the frequency model: automations are cadence, worktrees are isolation, plugins/connectors are tools, sub-agents are the maker/checker pair that makes verification real, skills map to task plus memory, state maps to memory. What the frequency model adds — verification as a first-class primitive, artifact, escalation, constraint, observability — names the connective tissue between those blocks. Individual primitives are shipping in products (/goal is verification, /loop is cadence, worktrees are isolation), but no product ships the composed system: the recurrence across sessions, the replanning, the upward signal flow. That's still yours to build.

The fractal — tight loops inside medium loops inside long loops, each with different primitives and different levels of human involvement — is the architecture I'm reaching for. The inner loops automate. The outer loops are where strategy and taste live. The flow between them (constraints propagating down, signals bubbling up) is the design problem I'm circling. The down-flow works: constraints propagate through specs, hooks, rules files. The up-flow is where the real signal gap lives. Production observability (error rates, latency trends, deploy health) is structured and machine-readable, but mostly unused as loop input. Business metrics (retention, revenue, NPS) exist but are noisy and slow. The medium loop could ingest Sentry and APM data today; almost nobody wires it up. The long loop's signals are still mostly sensed by humans: I notice a pattern, I change a plan. But the production layer is low-hanging fruit: the data is already there, already structured, already flowing. The loop just isn't listening to it.

"The long loop is human territory" is too simple. The more clearly you've articulated your strategy, product vision, engineering philosophy, and business context, the more the system can do with it. A loop that has access to your product roadmap, your architectural axioms, your definition of what matters — that loop can flag things that are underscoped. A roadmap item without a defined plan. A feature without acceptance criteria. A goal that contradicts an existing constraint.

That flagging is the real human-in-the-loop moment: the system saying "this item doesn't have enough definition for me to act on — here's what's missing." You go in, pressure-test the plan, refine the scope, make the judgment calls. Then it flows back down: the scoped plan becomes a medium-loop task, which spawns tight loops, which execute and verify. The system attempts to constrain itself and answer its own questions based on the context you've built — your product, your business, your engineering philosophy, all encoded as constraints.

The implication: the richness of your constraint layer determines how far down the stack the system can operate without escalating. Thin constraints mean constant escalation: the agent keeps asking you questions. Rich constraints (a well-articulated strategy, clear architectural principles, codified product axioms) mean the agent can self-resolve more decisions and only escalate the ambiguous ones. The human's job in the long loop is building the context that makes fewer calls hard.

At Product.ai we've started treating the constraint layer as its own product. We write specs that define what must be true before any agent starts building — the primary artifact.

The spec is the work. The code is what the work produces.

That inversion changed how the whole team ships: strategy becomes systems without a human typing in the middle, because the constraints already encode what "good" means.

The other variable is the model underneath. Fable 5 was live for three days before the Commerce Department pulled it over a claimed jailbreak. In that window, the duration-scaling was visible by the benchmarks: the performance gap over previous models grew with task duration, marginal on short work, significant on multi-hour sessions. That's exactly the regime where medium loops live. Tight loops were already reliable with current models. The question is whether that kind of extended coherence makes medium loops self-directing or just makes one-shots longer. As of mid-June, the model that would have answered that question is offline, and no public equivalent exists.

#The Stack

Four layers keep coming up in every conversation about agent tooling: prompt engineering (clear instructions), context engineering (what the agent sees — the differentiator was never the model; it was what the model could see), loop engineering (recurring, self-correcting cycles), and harness engineering (Viv Trivedy's term for safe execution: tool permissions, sandboxes, monitoring, human takeover, and the ratchet — every failure becomes a permanent constraint). Addy Osmani draws the relationship: the loop sits one floor above the environment. Prompts go inside context. Context goes inside loops. Loops run inside harnesses. Each answers a different question; getting one right doesn't exempt you from the others.

Most teams doing impressive work with coding agents right now, including mine, are closer to harness engineering with long autonomous sessions than to loop engineering. We split planning from implementation: a human-driven planning phase (grounding, spec, architectural decisions) followed by a long-running agent session that executes within constraints. The loop cycle exists (plan, execute, review, replan), but the cadence is human-led. I decide what to build next. I kick off the session. I review the output. I update the constraints. The exceptions prove the pattern: error triage, signal aggregation, dependency scanning — tasks with machine-verifiable exit conditions — run autonomously on a schedule. Those are actual loops. Everything else is a human driving the cycle with a well-built harness underneath. The gap between those two modes is where the interesting engineering problem lives.

#The Ratchet Is the Constraint Layer Tightening

The pattern that connects all of this to what I've been arguing:

The ratchet — Osmani's term for "every agent mistake becomes a permanent rule" — is the constraint layer getting tighter over time. The mutation space shrinks. The agent has less room to do the wrong thing. Quality goes up because the physics got stricter.

This is what "constraints bound variance" looks like in practice. I said months ago — in Code Owns Truth, in Design Physics, and again in the harness engineering field report — that the designer's job is Layer 1: engineering the constraints. The loop engineering discourse is arriving at the same conclusion from the operational side: the constraints determine the loop's ceiling. A well-tuned model with loose constraints still produces expensive chaos. Tighten the constraints and even a decent model gets reliable.

The interesting implication: the ratchet means the constraint layer is self-improving. Every loop iteration that fails teaches you something, and if you encode that something, the next iteration can't fail the same way. The system accumulates scar tissue. Over enough cycles, the constraints become a compressed record of every mistake the system ever made — which is another way of saying they become judgment, crystallized.

But a ratchet that only tightens eventually seizes. Reactive rules accumulate fast ("don't comment out tests," "block writes to migrations," "never delete a fixture file") and left unchecked, they become a contradictory rulebook that paralyzes the agent or makes the constraint file itself a maintenance burden. The move is consolidation: periodically review the reactive rules and distill them into proactive constraints. Ten specific "don't do X" rules about test files collapse into one architectural principle about test ownership. I'm staring at eighteen enforcement hooks in our constraint layer right now, and the consolidation pass is overdue. Five rules about migration safety become a single hook that enforces the policy structurally rather than through prose. The reactive rules are how you learn what matters. The proactive constraints are how you encode that learning durably. This is a long-loop function — orientation applied to the constraint layer itself. Are these rules still right? Have any become cargo cult? Can three of them collapse into one that's enforced by tooling instead of instructions? The ratchet tightens, but it also needs to be re-forged periodically into something leaner.

But the ratchet only explains half of what's happening. Call it the constraint ratchet: failures become rules, rules prevent backsliding, the system can't fail the same way twice. Click, click, tighter. That's necessary but it's not why the system gets better.

The other half is the skill flywheel. Ten specific "don't do X" rules collapsing into one architectural principle isn't just consolidation — it's the system developing judgment. The constraint layer taught you what mattered; the skill is knowing that before you need the rule. Sentry's fixability score is the skill flywheel, not the constraint ratchet. Nobody wrote a rule that says "don't try to autofix issues with tangled cross-service dependencies." The model learned to score those low from data — millions of errors, which fixes landed, which didn't.

The constraint ratchet is the floor. The skill flywheel is the ceiling rising. One prevents the system from getting worse. The other makes it genuinely better at the work.

And they feed each other: failures become constraints, constraints produce better-scoped runs, better-scoped runs generate cleaner signal about what works, cleaner signal sharpens skills, sharper skills produce new kinds of failures at a higher level — which become new constraints. The constraint ratchet and the skill flywheel compound together. Your CLAUDE.md getting more precise is a constraint. A skill definition that encodes how to scope a task before committing compute — that's the flywheel. Both live in files. Both are the system learning. The difference is direction: constraints say don't do this again, skills say here's how to do this well.

That's the version of this that actually matters. Not "while loops with LLMs inside." The system that builds skill over time — where failures become constraints, constraints become patterns, and patterns become judgment encoded into the physics rather than exercised per turn. Where taste gets encoded into the environment itself, and the encoding stays clean enough to remain useful.

#Where to Start

If any of this resonates and you want to start thinking in loops:

Start with the environment, not the loop. The most common failure mode is automating a cycle before the foundation is ready. Get your constraint files solid. Document conventions, protected paths, test commands, known failure patterns. The prep work pays returns for months.

Pick one task with a verifiable exit condition. Not "lint is clean" — something with real scope. Implement a feature where the exit condition is acceptance criteria met, tests pass, no regressions. Analyze a dataset and return a verdict with cited evidence. Design a component from a JTBD spec and output a working mockup. Review a contract and flag risks against standard terms. The pattern works anywhere the definition of done can be checked — by a test suite, a structured rubric, or a separate evaluator model grading the output.

The harness includes how you structure the approach: for a feature, one loop writes the tests from the spec (TDD), the next loops implement against them, checking for regressions between cycles. For research, one loop gathers sources, the next synthesizes, a third verifies citations. The decomposition into loops is itself a design decision — and one of the first places the ratchet teaches you something. Watch it converge. Learn where it gets stuck. That's your first ratchet input.

Encode every failure. This is the most important practice in the whole discipline. Every time the agent fails in a way you didn't anticipate, write it down as a rule. The agent commented out a test? "Never comment out tests — delete them or fix them." The agent modified a migration file? Block writes to migrations/. The constraint file gets longer. The failures get fewer. This compounds.

Graduate to replanning. Once tight loops converge reliably, zoom out. Can the system find its own work? Scan for open issues, failing CI, stale branches? Can it triage — decide what's worth doing versus what's blocked versus what needs a human? More importantly: after each cycle, does the plan still make sense given what was actually built? The backlog reshapes itself each cycle. That's the medium loop forming.

Be honest about orientation. Strategy, product direction, "should we even build this" — that's not automatable today. Models can surface patterns, flag lagging indicators, summarize signals. But the judgment call is yours. The most mature practitioners I've seen are explicit about where their loops stop and human sensing begins.

You don't need a specific product for any of this. Aider, Pi, Cursor, even a bash script that runs tests, pipes failures to an LLM, applies the diff, and reruns — the primitives are universal. What matters is whether the loop has a feedback signal and the constraints to stay on track.

The loop was always there. Most of us were just running it manually, one prompt at a time, without noticing what we were doing. The shift isn't the concept. The shift is making it explicit — and then sitting with what the ratchet implies.

Every failure I encode shrinks the territory I just called mine. The constraints get smarter, the loops more reliable. The boundary between "the agent handles this" and "this requires me" moves inward, and it doesn't move back. I don't know if there's a floor — whether there's some irreducible core of judgment that can't be crystallized into a constraint file, or whether the ratchet just keeps tightening.

That's the question the next year of running these loops is going to answer for me, whether I like the answer or not.

Update, July 9, 2026: The model that would have answered the duration-scaling question above came back online over a week ago, and two more useful models shipped this week.

Fable 5 is back — restored July 1 after 18 days offline. The export controls followed a real finding: Amazon's threat intelligence team discovered a way to bypass Fable 5's cybersecurity safeguards, the Commerce Department pulled both Fable 5 and Mythos 5 on June 12 while Anthropic retrained the safety classifier, and the government lifted the controls once the new classifier blocked the exploit in over 99% of cases. Worth being precise about that rather than waving it off as pure bureaucracy — there was an actual gap, and it got closed. Either way, the question about whether extended coherence makes medium loops self-directing or just makes one-shots longer is answerable again, once someone runs the experiment.

The more useful development shipped this week. Grok 4.5 landed July 8 at $2 input / $6 output per million tokens — against Opus 4.8's $5/$25 or Fable 5's $10/$50 — and it's genuinely efficient: 4.2× fewer output tokens than Opus 4.8 on SWE-Bench Pro for comparable work. It's not frontier — it trails Opus on the harder benchmarks and Fable by a wide margin — but "roughly last year's flagship at 60% off" is exactly the shape of model a tight loop wants running its volume work. GPT-5.6 went broadly public the next day, July 9, after nearly two weeks gated to a couple dozen government-vetted partners — three SKUs of one generation, Sol, Terra, Luna, priced flagship down to fast-and-cheap, which means OpenAI productized the tiering decision instead of leaving it to whoever's assembling the loop.

Which finally gives the pattern from the top of this piece a name: the asymmetric bookend. Expensive model architects the plan and writes the test cases. Cheap model, or several, run the volume work inside that plan, iterating until they converge or hit a wall. Expensive model judges the output against the original spec — not the same model that wrote the code, so it isn't grading its own homework. Tier by position in the loop, not by task difficulty — a distinct move from the fixability-score tiering Sentry runs, which tiers by confidence before committing any compute at all.

It's also the fix for my own opening scene. Thirty subagents on Opus, no plan holding them to a spec, everyone running the expensive model for undifferentiated work, a day's quota gone for a one-line import. A bookend catches that at the judge tier before it burns the account — or never spawns the swarm at all, because the plan tier would have scoped it as a one-line fix in the first place.

One caveat, because the ratchet only works if you're honest about where it doesn't hold: cheap-model loops need tighter constraints than expensive ones, not looser. More iterations means more chances to drift, and because each iteration costs less, it's tempting to under-invest in guardrails exactly where they matter most. The bookend isn't a free lunch. It's a different place to spend the same vigilance.