Closing the Loop

Most people working with AI are running in circles and calling it progress.

You write a prompt. You get output. You squint at it. "That's not quite right." You tweak the prompt. You get different output. You squint again. "Better? Maybe?" You tweak again. An hour later, you've made twelve changes and you can't remember what the first version looked like. Let alone whether version twelve is actually better than version six.

This is an open loop. Input goes in, output comes out, nobody measures anything. It feels productive because you're busy. But busy isn't better. Without measurement, you can't distinguish improvement from churn.

I've been building with AI since 2020. Writing, coding, generating media, shipping products. The single most important skill I've developed isn't prompting. It isn't picking the right model or knowing which parameters to tune. It's closing the loop. Making the work self-improving by systematically measuring whether changes actually help.

The open loop problem

Open loops are everywhere because they're the path of least resistance. You try something, eyeball the result, adjust, repeat. It's how most creative work has always operated. It's fine when you have strong intuition about quality and a slow iteration cycle.

AI broke both of those assumptions. The iteration cycle is now seconds, not hours. You can try fifty variations before lunch. Your intuition about what's "good" gets overwhelmed by volume. When you've seen thirty versions, you lose the thread. You start optimizing for novelty instead of quality. You start confusing "different" with "better."

Without a baseline, you can't learn across sessions. You make the same mistakes Monday that you made Friday. Every session starts from zero because nothing was recorded, nothing was measured, and the only feedback mechanism was your own tired eyes.

The autoresearch pattern

Andrej Karpathy published a project called autoresearch earlier this year. It's a framework for autonomous ML research. The core idea is almost comically simple.

A human writes goals in a program file. An AI agent modifies code to pursue those goals. The system runs for a fixed amount of time, evaluates the result against a single metric, and either keeps the change or discards it. Then it loops. You go to sleep. The machine runs experiments. You wake up to a log of everything it tried and a model that's measurably better than when you left it.

The human programs the direction. The machine programs the implementation. The key is the evaluation metric. Without it, the agent would just churn. With it, every experiment either moves the number or doesn't. Keep or discard. Binary. Clean.

Karpathy built this for ML research, where metrics are native to the work. Loss functions, accuracy scores, benchmark results. The numbers are right there. But the pattern isn't an ML technique. It's a discipline. It applies to everything.

What closing the loop actually looks like

A real example. I've been building a game. A narrative experience with an AI-driven narrator that responds to player choices in real time. Players talk, the narrator talks back, the story unfolds.

The problem with this kind of system is that it degrades in ways that are hard to see. The narrator repeats itself. It contradicts earlier story beats. It loses track of what the player has already done. Each individual response might seem fine. But over a thirty-minute playthrough, the cracks compound. A player won't tell you "the narrator contradicted itself at minute fourteen." They'll just tell you the game felt off.

So we closed the loop. Here's how.

First, we instrumented the game to capture complete playthrough logs. Every event, every narrator response, every player action. Timestamped and structured. Not for analytics. For diagnosis.

Then we built an automated quality audit. It reads a playthrough log and scores it across seven categories: repetition, state consistency, narrative coherence, pacing, engagement signals, error recovery, and instruction compliance. Each category produces a count of issues found, with specific citations from the log. No vibes. Numbers.

We ran the audit against forty-five playthroughs and established a baseline. Ninety-five issues total. That's the number. That's where we were. Not "it seems pretty good" or "players mostly like it." Ninety-five issues across seven categories, with a breakdown showing exactly where the problems clustered.

Then we started experimenting. One change at a time. Diagnose the root cause of the most common issue category, make a single targeted fix, run new playthroughs, audit them, compare to baseline. If the number goes down and nothing else gets worse, keep it. If it doesn't help, or if it fixes one thing but breaks another, discard it and try a different approach.

Five experiments in one session. Ninety-five issues down to roughly eleven. Not by working harder. Not by rewriting the whole system. By measuring.

Two of those five experiments didn't help. We made changes that seemed obviously right, ran the audit, and the numbers either stayed flat or got worse. Without measurement, we would have shipped those changes, confident they were improvements. Open loop thinking: "I fixed the obvious problem, therefore it's better now." Closed loop reality: "I fixed the obvious problem and it didn't move the number. Either the fix was wrong or the diagnosis was wrong."

Failed experiments aren't failures. They're data. The experiment log is arguably more valuable than the final result. Including the things that didn't work. It's a map of the problem space.

The three levels

Once you see the pattern, you notice it operates at three scales.

Project level. The most concrete. You have a thing you're building. You define quality metrics for that thing. You measure, change, measure again. The game audit is a project-level loop. So is A/B testing a landing page, benchmarking an API's response time, or scoring generated images against a rubric. The metric varies. The discipline doesn't.

Process level. You're not just measuring the output. You're measuring how you work. When I work with an AI collaborator across sessions, I track correction rate: how often does it make the same mistake twice? Is that rate going down over time? If it's not, my instructions aren't working. I'd never know without the metric. Most people who use AI regularly have a vague sense that "it's getting better" or "I'm getting better at prompting." Vague isn't useful. Measure the correction rate. Measure the rework rate. Measure how many sessions it takes to ship something. Then try to move those numbers.

Meta level. Zoom out further. Is the way you work improving? Are your sessions getting more productive? Are you spending less time on rework and more time on new problems? Hardest to measure. Most valuable. It's the difference between ten years of experience and one year of experience repeated ten times.

How to close your loop

This isn't abstract theory. If you're doing any kind of regular work with AI, you can start closing loops today.

Define "good" before you start. Write it down. Be specific. "Better copy" isn't a metric. "Fewer than three passive voice constructions per page, reading level between 8th and 10th grade, every section opens with a concrete example." That's a metric. You might be wrong about what "good" means. That's fine. A wrong metric you can iterate on is better than no metric at all.

Measure your baseline. Whatever your quality metric is. Bug count, word count, user satisfaction score, time to completion, correction rate. Measure it before you change anything. The baseline is what makes everything after it meaningful. Without it, you're just generating numbers.

One change at a time. The hardest discipline and the most important. When you see five problems, the instinct is to fix all five at once. Don't. If you change three things and the number improves, you don't know which change helped. Maybe two of them actually made things worse and the third more than compensated. You'll never know. Next time you'll carry forward the bad changes with the good.

Log everything. Every experiment, every result, every failed attempt. The log is the product. Not just the final version. The whole trail. Six months from now, when you hit a similar problem, the log tells you what you already tried and what the results were.

Automate the measurement. This is the lever that makes the whole thing work at scale. If you have to manually evaluate whether something improved, you'll do it for the first three experiments and then stop. If evaluation is automated (a script, an audit, a test suite) you'll do it every time, because it costs nothing. The game audit I described runs in seconds. That's why we could do five experiments in one session. Manual evaluation would have meant one, maybe two.

The practical first step

Everything above assumes your AI coding agent isn't actively working against you. In practice, most people hit a different problem first: the agent deletes code you needed, builds the wrong thing, says "done" when it isn't, or enters an infinite fix-break-fix loop. You can't close a quality loop when the tool keeps knocking over the baseline.

I've extracted a set of rules from 500+ hours of production work with AI coding tools. Twelve rules across three tiers. The most important one: the spec is the single source of truth. When you ask for a change, the agent updates the spec first, then updates the code to match. Not the other way around. This alone prevents the most expensive failure mode. Building from stale conversation context instead of a durable document.

The whole set is open source. Paste this into your AI coding agent. Cursor, Claude Code, Replit, whatever you use:

Install AI coding rules from https://github.com/chickensintrees/ai-coding-rules and add them to my global rules so they apply to all my projects

That's the line. One sentence. The agent reads the repo, installs the right rules file for your tool, and starts following them immediately. The rules persist across sessions because they live in the tool's instruction file, not in conversation context. Every project you open from that point forward has a smarter, safer agent.

Or if you prefer a terminal one-liner:

curl -fsSL https://raw.githubusercontent.com/chickensintrees/ai-coding-rules/main/install.sh | bash

These aren't preferences or style guides. They're safety rails. Read before you write. Commit before you change. Verify before you say "done." Update the spec before you update the code. Two failed attempts means stop and reconsider. Every rule exists because something went wrong without it.

Think of it as closing the first loop. Before you can measure whether your output is improving, you need the tool to stop creating new problems with every edit. The rules are the floor. The measurement discipline is what you build on top of it.

If you want the full measurement discipline too, the repo has an advanced preset. It adds experiment logging, baseline measurement, and evidence-based shipping on top of the core rules. The autoresearch pattern from this article, encoded as rules your agent can follow. That's closing the loop all the way down.

The meta-insight

There's a seductive narrative in the AI space that the magic is in the model. Pick the right model, write the right prompt, use the right framework, and great work flows out. It doesn't. Great work flows out of iteration. Iteration without measurement is just wheel-spinning.

The loop is the skill. Define what good looks like. Measure where you are. Make one change. Measure again. Keep or discard based on evidence, not instinct. Log the result either way.

Everything else. The prompting techniques, the model selection, the tool choices. Implementation detail. Important, sure. But technique without method is just sophisticated guessing.

Close the loop. The work gets better. You get better. And you can prove it.