When I need something built now, I don't start by prompting a coding agent. I start by writing a spec.
Not because I stopped using them. The opposite. I use them more than ever — ontology research at Product.ai that took a full day per category now fans out across five providers overnight and clears by morning. But the way I use them changed — slowly, then suddenly enough that when Peter Steinberger posted two sentences last week and the internet lost its mind, my first reaction was: yeah, obviously.
"...you shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."
Boris Cherny, head of Claude Code at Anthropic, said the same thing on stage a few days earlier: "I don't prompt Claude anymore. I have loops running. They're the ones prompting Claude and figuring out what to do. My job is to write loops."
Millions of views. Addy Osmani formalized it. A dozen Substacks published guides. The SEO farms produced their "What Is Loop Engineering?" posts before Friday. Reddit, doing what Reddit does, replied: "It's just a while loop."
They're not wrong. But they're not as right as they think they are.
#The Shift I Didn't Name
Here's what my workflow actually looks like now. I spot a problem — a failing test, a performance regression, an architectural pattern that's drifting. I don't open a terminal and start prompting the coding agent. I write a spec: what the goal is, what the constraints are, what "done" looks like in testable terms. Sometimes I write the spec myself. More often I think it through with an AI as a thought partner — I'm still prompting, but at the planning layer, not the execution layer. Then I hand that spec to an orchestrator who diagnoses the approach, picks the right tools, and dispatches the work to a coding agent running in a sandboxed environment. The coding agent writes code, runs tests, reads the errors, fixes, runs again. When it finishes, the result comes back up the chain. I review, approve or redirect, and move on.
At no point did I type "please fix the auth middleware." The prompting didn't disappear — it moved up the stack. From "write this code" to "help me think about what this should be" to a structured spec that the system executes inside constraints. My job was designing those constraints and making the judgment calls that shaped the spec.
Same pattern for our knowledge pipeline — I wrote validation specs and quality thresholds instead of research prompts, the forge runner dispatched across providers overnight, and I reviewed clean output over coffee. Different domain, same shift.
I'd been doing this for months before anyone called it loop engineering. And honestly, the name sounds like more overhyped AI terminology — it is just a while loop. But naming it sharpened how I thought about what I was already doing. The label made the structure visible.
The three-layer model I proposed earlier this year maps directly onto why this works:
CONSTRAINTS → bounds the mutation space
PROMPTS → expresses intent, inherently fuzzy
CODE → source of truth, deterministic, testableA loop is these three layers cycling autonomously. The constraint layer — rules, permissions, stop conditions, protected paths — bounds what the agent can do. The prompt layer — the goal definition, the task spec — tells it what to try. The code layer — tests, linters, build checks — decides whether it actually worked. Constraints bound variance. Prompts express intent. Code owns truth. The loop repeats until truth is satisfied or the constraints say stop.
#What Everyone's Describing
The anatomy of a single well-designed loop has five parts, and none of them are surprising. This is the convergence cycle — one goal, one or more agents, iterating until done:
A clear goal. "Make all tests in test/auth/ pass" works. "Make the app better" produces an infinite loop or meaningless output. The goal has to be evaluable by something other than vibes. In practice, that means the spec I write before dispatching — "done" defined before the agent starts.
But don't let the fix-and-repair examples mislead you about what goals can be. The most powerful loops are generative. The goal isn't "fix this bug." It's "implement this feature to satisfy this job-to-be-done" or "build the onboarding flow from this spec and verify it against these acceptance criteria." A roadmap item with clear exit conditions is a loop goal. A user story with testable outcomes is a loop goal. The pattern works anywhere the definition of done can be made concrete enough for a machine to check — and that's a much larger surface than just passing tests.
Tools. File system access, code execution, test runners, shell commands. The overnight ontology research fans out across five providers, each with browser and search access — the tool set determines what the loop can observe. A loop without tools is a model guessing in a circle.
Context management. Every iteration generates more history. Code written, errors hit, approaches tried. Without pruning or summarization, you hit token limits or the model loses track of what it already attempted. I wrote about this in The Context Engineering Stack — the retrieval quality at each step determines the loop's ceiling. (And forgetting well matters as much as retrieving well — Funes the Memorious had perfect memory and couldn't think.)
Termination logic. When to stop. Tests pass. Max iterations reached. Repeated errors with no progress. Hand off to a human. I watched one loop run forty iterations on a single type error last month — burned more tokens than my entire team used on Tuesday. If a human engineer spent four hours on one error without escalating, you'd call it a performance issue. Loops need the same instinct: iteration budgets, cost ceilings, escalation triggers that fire before the loop spends more than the fix is worth.
Error recovery. Recoverable errors (bad syntax, missing import) versus hard blockers (missing credentials, undefined behavior). The forty-iteration type error above was an error-recovery failure as much as a termination one — same approach on repeat, no variation. A loop that retries the exact same action after the exact same error isn't iterating; it's stuck.
The ReAct pattern — Reason + Act — from the Princeton/Google research. Individual primitives from this anatomy have started shipping in products — convergence checking, scheduled runs, parallel isolation, subagent delegation — but the tooling implements pieces, not the composed pattern. The concept predates LLMs by decades. The tooling is catching up — and that's less exciting than "loops replace prompts" as a headline.
#What It's Not
It's not full autonomy. A loop without budgets and escalation paths is an expensive accident. Token costs scale with iteration count, and the choice between "token-rich" loops (full context every turn) and "token-poor" loops (compressed summaries) is already an active design decision with real cost tradeoffs. You're still designing the track.
It's not a replacement for prompt engineering. Prompts are still inside the loop — they become subagent definitions, skill specs, task descriptions. But the leverage moved. The prompt is one component; the context you engineer around it is what determines whether the loop converges or spins.
It's not new computer science. Reddit's right — structurally, it is a while loop. The tooling is what changed — it makes the pattern practical without custom bash scripts and duct tape.
And it's not free. A loop without economic circuit breakers isn't converging — it's an infinite loop with a credit card attached.
And it's not a long agent session. An agent running for four hours with dynamic workflows and parallel subagents is impressive, but duration and complexity don't make it a loop. A loop recurs — each cycle's output feeds the next cycle's input. What drives the recurrence varies: a cron schedule, a CI trigger, a webhook, or the human operator opening a new session tomorrow morning. The mechanism matters less than the fact that the next cycle exists and inherits from the last one. It verifies automatically — something other than the agent or a human eyeballing it checks whether the cycle converged. It persists state — what happened last time is available next time. It terminates or escalates on its own — knows when to stop, when to ask for help, when the goal is met. Without recurrence and automated feedback closing the signal, you have a powerful session, not a loop. The compounding value — the ratchet, the constraint accumulation — comes from iteration across cycles, not computation within one.
#Loops Nest
Everyone's talking about the tight loop. Agent writes code, runs tests, reads error, fixes, repeats. Minutes per iteration. Fast feedback, narrow scope, clear exit condition. And that's useful — it's what makes Claude Code's /goal "all tests pass" work.
But that's one frequency. Loops nest. And the nesting is the architecture.
Tight loops run in minutes or hours. Write code → run tests → read error → fix → run again. The function is convergence — iterate toward a defined goal until it's met. The feedback signal is binary — pass or fail. Test runner, type checker, linter, build success. Seconds to minutes. The cadence is continuous: keep going until verification succeeds or the iteration budget runs out. Human judgment is encoded in the test suite, not exercised per iteration.
Medium loops run over hours or days. Scan the repo for open issues → triage by priority → assign to agents → review the PRs that come back → merge or redirect → update the backlog. The function is adaptive planning — not just burning through a fixed backlog, but re-evaluating the plan against what was actually built. A PR might pass every test and still have bad architecture. An issue might be technically solvable and strategically wrong to solve right now. What separates this from a sprint board is the feedback: each cycle's output reshapes the next cycle's input. The human is at the gate — not in every turn, but at every point that requires taste.
The medium loop's feedback signal isn't just PR review — it's production sensing. Error tracking (Sentry, Datadog) surfacing regression spikes after deploys. APM data showing latency percentiles creeping. Log aggregation catching recurring warning patterns. Canary deploys and feature flags giving you a structured before/after signal on real traffic. Beta users and dogfooding catching the things test suites can't — the flow that's technically correct but feels wrong, the edge case that only appears with real data. This data already exists in most production systems; the underexplored move is feeding it into the loop as a structured signal. A medium loop that ingests your error tracking feed, correlates new exceptions with recent PRs, and auto-triages fix priority is the natural next step — and almost nobody's doing it yet.
Long loops run over weeks or months. What patterns keep failing? Which architectural decisions are calcifying? Are the constraints still right or have they become cargo cult? What should the team build next quarter? The function is orientation — sensing, triaging, evaluating against lagging indicators and strategic goals. The feedback signal here is slower and noisier: user metrics (retention, conversion, churn), support ticket patterns, developer velocity trends, NPS, customer research. No test runner evaluates whether your product direction is correct. A retention metric tells you in six weeks if the decision was right — and even then you're guessing at causality. The cadence is mostly human-initiated: quarterly reviews, post-mortems, the moment when metrics start telling a story you didn't expect.
The relationship between loop frequency and human involvement isn't incidental. It's structural:

Long loop (weeks/months): "Orient — what should we build and why?"
└─ Medium loop (hours/days): "Replan — what's next, given what we just learned?"
└─ Tight loop (minutes/hours): "Converge — write, test, fix, repeat"The tighter the loop, the more the feedback is binary and automatable. The longer the loop, the more the feedback is lagging, noisy, and dependent on causal interpretation — judgment, context, experience, taste. The things models can't do yet and might not do soon. Human judgment doesn't leave the system when you adopt loop engineering. It moves up the stack. You stop being the person who reads each error message and start being the person who decides what "done" means, what's worth building, when the constraints need updating, and when the loops should escalate back to you.
I think this is the under-explored idea. The fractal — tight loops inside medium loops inside long loops, each with different feedback signals and different levels of human involvement — is the actual architecture. The inner loops automate. The outer loops are where strategy and taste live. And the flow between them — constraints propagating down, signals bubbling up — is the design problem I'm circling. The down-flow works: constraints propagate through specs, hooks, rules files. The up-flow is where the real signal gap lives. Production observability — error rates, latency trends, deploy health — is structured and machine-readable, but mostly unused as loop input. Business metrics — retention, revenue, NPS — exist but are noisy and slow. The medium loop could ingest Sentry and APM data today; almost nobody wires it up. The long loop's signals are still mostly sensed by humans: I notice a pattern, I change a plan. No mechanism yet for the strategic layer, just attention. But the production layer is low-hanging fruit — the data is already there, already structured, already flowing. The loop just isn't listening to it.
But "the long loop is human territory" is too simple. The long loop isn't purely manual — it's a collaboration. The more clearly you've articulated your strategy, product vision, engineering philosophy, and business context, the more the system can do with it. A loop that has access to your product roadmap, your architectural axioms, your definition of what matters — that loop can flag things that are underscoped. A roadmap item without a defined plan. A feature without acceptance criteria. A goal that contradicts an existing constraint.
That flagging is the real human-in-the-loop moment: the system saying "this item doesn't have enough definition for me to act on — here's what's missing." You go in, pressure-test the plan, refine the scope, make the judgment calls. Then it flows back down: the scoped plan becomes a medium-loop task, which spawns tight loops, which execute and verify. The system attempts to constrain itself and answer its own questions based on the context you've built — your product, your business, your engineering philosophy, all encoded as constraints.
The implication: the richness of your constraint layer determines how far down the stack the system can operate without escalating. Thin constraints mean constant escalation — the agent keeps asking you questions. Rich constraints — a well-articulated strategy, clear architectural principles, codified product axioms — mean the agent can self-resolve more decisions and only escalate the ambiguous ones. The human's job in the long loop isn't just "make the hard calls." It's building the context that makes fewer calls hard.
At Product.ai we've started treating the constraint layer as its own product. We write specs that define what must be true before any agent starts building — the primary artifact.
That inversion changed how the whole team ships: strategy becomes systems without a human typing in the middle, because the constraints already encode what "good" means.
#The Primitives
The five parts above describe the anatomy of a single loop — one agent working toward one goal. But when loops nest, when the medium loop spawns tight loops and the long loop reshapes both, the parts list expands. A system of loops needs primitives the single loop doesn't. I found the full set the way you always do — by running loops that broke and asking what was missing. The five anatomy parts redistribute: tools stay tools, termination splits into cadence and verification, error recovery becomes escalation, goal and context management evolve into task and memory. The three new primitives — artifact, isolation, constraint — exist because systems of loops have problems single loops don't.
#Cadence
When and why the loop runs again. This is the primitive that makes a loop a loop rather than a one-shot.
A tight loop's cadence is continuous — keep iterating until verification passes. A medium loop's cadence is scheduled or event-driven — daily triage, CI triggers, the cron job that scans for new work every morning. A long loop's cadence is human-initiated — quarterly reviews, the post-mortem after a bad deploy, the moment metrics tell a story you didn't expect.
The skeptic's objection — "cronjobs have funny rebranding" — is half right: the scheduling layer is cron. What's new is the decision logic in the middle. A cron job runs a fixed script. A loop-on-cron runs a model that observes current state, decides what to do, acts, and checks whether it worked. The timer is old; the judgment per tick is not.
The cadence primitives shipping today reflect this split. Claude Code's /loop is session-scoped polling — watch-style, re-fires a prompt on an interval while you're in the conversation. For scheduling that runs independently — with the laptop closed, across sessions — there are Desktop scheduled tasks, Cloud Routines, GitHub Actions, plain cron: anything that can trigger an agent run on a cadence without a human in the chair. Codex Automations ship the scheduled discovery pattern: runs that find work, triage it, and land findings in an inbox for human review.
#Artifact
The output that makes the loop's state legible. An HTML report with color-coded severity. An SVG diagram. A rendered diff with inline annotations. Thariq Shihipar at Anthropic made the case that HTML has replaced Markdown as the right output format for agent work — because when you're reading agent output, not editing it, you want spatial layout, embedded diagrams, interactive controls. The artifact is how the loop communicates with the human at the gate. Its quality determines how well that human can do their job. A 200-line Markdown dump is where they lose the thread.
#Task
The unit of work. A markdown spec, a GitHub issue, a Linear ticket, a YAML file. The thing that carries goal, context, and exit condition in one package. Breakable into subtasks. The medium loop lives here — task discovery, triage, assignment, completion. The format is less important than the structure: what's the goal, what are the constraints, what does done look like.
#Verification
The feedback signal. Tests (TDD), type checks, linters, build success, visual snapshots, even "does the HTML look right when I open it." Ranges from binary (tests pass) to fuzzy (PR review, architectural judgment). Cadence is what makes the loop recur; verification is what makes it converge — without it, the agent has no way to know if it succeeded, and the loop spins instead of tightening.
Claude Code's /goal ships this as a tight-loop convergence primitive — you set a condition, a separate evaluator model checks after each turn whether it holds, and the agent keeps working until it's met. The kind of thing an agent uses autonomously within a single cycle: "/goal all tests pass and lint is clean" and walk away. The agent that wrote the code isn't grading its own homework.
#Isolation
Worktrees, sandboxes, branches, separate agent sessions. The thing that lets multiple loops run in parallel without collision. Two agents editing the same file is two engineers committing to the same branch with no PR process. Isolation is the primitive that makes parallelism safe. Both Claude Code and Codex ship this as built-in worktrees — a fresh branch per agent session, merged when the work converges.
But isolation solves the easy problem — preventing writes from colliding. The harder problem is coordination: what happens when two sibling loops make decisions that are individually correct but collectively incompatible? One agent refactors an interface. Another agent builds a feature that depends on the old shape. Both loops converge, both pass their tests, and the merge fails. Isolation without coordination is parallel work that creates serial cleanup. The honest answer is that this is still mostly unsolved. The workarounds are blunt: serialize work that touches shared interfaces, or accept that the medium loop's job includes resolving conflicts between its children. A merge failure from parallel agents isn't a bug in either agent — it's a missing coordination primitive that the loop system doesn't have yet.
#Memory
AGENTS.md, state files, markdown logs, progress trackers, the repo itself. What persists between iterations and across sessions. Osmani's line: "The agent forgets, the repo doesn't." The ratchet lives here — the accumulated constraints that encode every past failure.
#Tools
Filesystem, shell, browser, MCP servers, APIs. The interfaces between the agent and the real environment. Claude in a browser is a tool. grep is a tool. A test runner is a tool. The tool set bounds what the loop can do — an agent without code execution can't close a feedback loop on code.
#Constraint
Rules files, hooks, permission boundaries, protected paths. What bounds what the loop can do wrong. This is the physics layer from the three-layer model — the constraint engineering that makes everything else safe to automate.
#Escalation
The handoff mechanism. PR reviews, triage inboxes, Slack notifications, the ping that says "I'm stuck, please look." How the inner loop hands control to the outer loop. Without escalation, a stuck agent burns tokens indefinitely instead of asking for help.
#Observability
Logs, traces, cost tracking, the ability to diagnose failures in the loop system itself rather than the code it produces. When a tight loop burns forty iterations on a type error, the failure isn't in the agent's code — it's in the loop's termination logic. When two parallel loops produce incompatible merges, you need to see the decision trace, not just the merge conflict. The artifact primitive helps here — the loop's own execution is a thing worth rendering legibly — but most setups don't treat loop observability as a first-class concern yet. They should. A system you can't debug is a system you can't trust past toy scale.
These compose differently at each frequency. A tight loop needs cadence + task + tools + verification + isolation. A medium loop adds artifact + memory + escalation, and the cadence shifts from continuous to scheduled. The long loop is mostly constraint refinement + task generation + human judgment — the cadence is the longest, the primitives it relies on are the least automatable, and the replanning between cycles is where strategy actually lives.
Individual primitives are shipping. No product ships the composed system — the recurrence across sessions, the replanning, the upward signal flow. That's still yours to build.
The other variable is the model underneath. Mythos-class models — Fable 5 and equivalents — show a performance gap over previous generations that grows with task duration: marginal on short work, significant on multi-hour agentic sessions. That's exactly the regime where medium loops live. Tight loops were already reliable. The question is whether the extended coherence makes medium loops practical or just makes one-shots longer.
#Four Layers, One Stack
These same four layers keep coming up in every conversation about agent tooling. Yes, four kinds of "engineering" is a lot of engineering. But they're layers in a single stack, not competing disciplines — each one owns a different question:
Prompt engineering — how do I write a clear instruction? The oldest layer. Role definitions, output formats, examples, constraints. Still necessary. Now one component among several.
Context engineering — what information does the agent need? I wrote about this at length — the constraint layer (CLAUDE.md, axioms, manifests), semantic search (embeddings, reranking), agentic search (multi-turn reasoning over code). The layer that determines what the model sees. The differentiator was never the model. It was what the model could see.
Loop engineering — how does the agent keep working until done, and when does it start again? Goal definition, cadence, verification, error recovery, termination. The operational layer that makes a single prompt-and-context pair recurring, self-correcting, and — at the medium and long frequencies — self-directing.
Harness engineering — how does the agent run safely? Viv Trivedy coined the term. His equation: Agent = Model + Harness. Tool permissions, sandbox isolation, execution logs, monitoring, human takeover. And the ratchet — every failure becomes a permanent constraint. The agent deletes a test instead of fixing it? Add a rule. The agent modifies a migration file? Add a hook that blocks it. The system gets smarter over time because you encode every failure into the constraint layer.
Addy Osmani draws the relationship: the loop sits one floor above the environment. Same components, but now they run on a schedule, spawn helpers, and feed themselves.
Prompts go inside context. Context goes inside loops. Loops run inside safe environments. Each layer answers a different question. Getting one right doesn't exempt you from the others.
Most teams doing impressive work with coding agents right now, including mine, are closer to harness engineering with long autonomous sessions than to loop engineering. What we do is split planning from implementation into two discrete steps: a human-driven planning phase — grounding, spec, architectural decisions — followed by a long-running agent session that executes within the constraints the planning step defined. The implementation step is harness engineering: dynamic workflows, subagents, hours of autonomous work. The loop cycle — plan, execute, review, replan — exists, but the cadence is mostly human-led. I decide what to build next. I kick off the session. I review the output. I update the constraints.
The exceptions prove the pattern. Error triage, signal aggregation, dependency scanning — tasks with machine-verifiable exit conditions and structured input — those run autonomously on a schedule. Those are actual loops. Everything else is a human driving the cycle with a well-built harness underneath. The gap between those two modes is where the interesting engineering problem lives.
#The Ratchet Is the Constraint Layer Tightening
The pattern that connects all of this to what I've been arguing:
The ratchet — Osmani's term for "every agent mistake becomes a permanent rule" — is the constraint layer getting tighter over time. The mutation space shrinks. The agent has less room to do the wrong thing. Quality goes up not because the model got smarter but because the physics got stricter.
This is what "constraints bound variance" looks like in practice. I said months ago — in Code Owns Truth, in Design Physics, and again in the harness engineering field report — that the designer's job is Layer 1: engineering the constraints. The loop engineering discourse is arriving at the same conclusion from the operational side: the constraints determine the loop's ceiling. A well-tuned model with loose constraints still produces expensive chaos. Tighten the constraints and even a decent model gets reliable.
The interesting implication: the ratchet means the constraint layer is self-improving. Every loop iteration that fails teaches you something, and if you encode that something, the next iteration can't fail the same way. The system accumulates scar tissue. Over enough cycles, the constraints become a compressed record of every mistake the system ever made — which is another way of saying they become judgment, crystallized.
But a ratchet that only tightens eventually seizes. Reactive rules accumulate fast — "don't comment out tests," "block writes to migrations," "never delete a fixture file" — and left unchecked, they become a contradictory rulebook that paralyzes the agent or makes the constraint file itself a maintenance burden. The move is consolidation: periodically review the reactive rules and distill them into proactive constraints. Ten specific "don't do X" rules about test files might consolidate into one architectural principle about test ownership — I'm staring at eighteen enforcement hooks in our constraint layer right now, and the consolidation pass is overdue. Five rules about migration safety might become a single hook that enforces the policy structurally rather than through prose. The reactive rules are how you learn what matters. The proactive constraints are how you encode that learning durably. This is a long-loop function — orientation applied to the constraint layer itself. Are these rules still right? Have any become cargo cult? Can three of them collapse into one that's enforced by tooling instead of instructions? The ratchet tightens, but it also needs to be re-forged periodically into something leaner.
That's the version of this that actually matters. Not "loops replace prompts." Not "while loops with LLMs inside." The system that tightens its own constraints over time, where human judgment gets encoded into the physics rather than exercised per turn. Where taste gets encoded into the environment itself — and where the encoding stays clean enough to remain useful.
#Where to Start
If any of this resonates and you want to start thinking in loops:
Start with the environment, not the loop. The most common failure mode is automating a cycle before the foundation is ready. Get your constraint files solid. Document conventions, protected paths, test commands, known failure patterns. I spent two weeks doing nothing but writing axioms and constraint specs before letting an agent run unsupervised. That prep work is still paying returns months later.
Pick one task with a verifiable exit condition. Not "lint is clean" — something with real scope. Implement a feature where the exit condition is acceptance criteria met, tests pass, no regressions. Analyze a dataset and return a verdict with cited evidence. Design a component from a JTBD spec and output a working mockup. Review a contract and flag risks against standard terms. The pattern works anywhere the definition of done can be checked — by a test suite, a structured rubric, or a separate evaluator model grading the output.
The harness includes how you structure the approach: for a feature, one loop writes the tests from the spec (TDD), the next loops implement against them, checking for regressions between cycles. For research, one loop gathers sources, the next synthesizes, a third verifies citations. The decomposition into loops is itself a design decision — and one of the first places the ratchet teaches you something. Watch it converge. Learn where it gets stuck. That's your first ratchet input.
Encode every failure. This is the most important practice in the whole discipline. Every time the agent fails in a way you didn't anticipate, write it down as a rule. The agent commented out a test? "Never comment out tests — delete them or fix them." The agent modified a migration file? Block writes to migrations/. The constraint file gets longer. The failures get fewer. This compounds.
Graduate to replanning. Once tight loops converge reliably, zoom out. Can the system find its own work? Scan for open issues, failing CI, stale branches? Can it triage — decide what's worth doing versus what's blocked versus what needs a human? More importantly: after each cycle, does the plan still make sense given what was actually built? This is where the medium loop starts to form — not a fixed backlog, but an adaptive one.
Be honest about orientation. Strategy, product direction, "should we even build this" — that's not automatable today. Models can surface patterns, flag lagging indicators, summarize signals. But the judgment call is yours. The most mature practitioners I've seen are explicit about where their loops stop and human sensing begins.
You don't need a specific product for any of this. Aider, Pi, Cursor, even a bash script that runs tests, pipes failures to an LLM, applies the diff, and reruns — the primitives are universal. What matters is whether the loop has a feedback signal and the constraints to stay on track.
Design your constraints. Let the agents iterate inside them. Encode every failure. Move your judgment up the stack as the inner loops get reliable.
The loop was always there. Most of us were just running it manually, one prompt at a time, without noticing what we were doing. The shift isn't the concept. The shift is making it explicit — and then sitting with what the ratchet implies.
Every failure I encode shrinks the territory I just called mine. The constraints get smarter, the loops more reliable. The boundary between "the agent handles this" and "this requires me" moves inward, and it doesn't move back. I don't know if there's a floor — whether there's some irreducible core of judgment that can't be crystallized into a constraint file, or whether the ratchet just keeps tightening.
That's the question the next year of running these loops is going to answer for me, whether I like the answer or not.