The "Make it an A" Loop

4/15/2026 · 5 min read · 932 words

The prompt is four sentences long and lives in a JSON file on my server:

A draft PR has been opened.

Is this done as Rails would do it? What letter grade would you give this PR? What could be improved? Is everything implemented and wired in and integrated? Do not commit stubs. Continue improving until it is an A. Mark the PR as ready when it is an A.

That's it. When a draft PR opens, my hook-to-agent relay fires this at whichever tmux pane owns the worktree, and the agent grades its own work before anyone else looks at it.

Why a Grade

The interesting word is "grade." I tried other framings first: "review this PR," "look for issues," "what's missing." They all produced earnest-sounding responses that mostly just described what the PR did. Asking for a letter grade changes the output. The agent has to pick a letter, and picking a letter requires a comparison to some implicit ideal. Once there's a grade, there's a gap, and once there's a gap, there's something to fix.

"A" is the only acceptable landing. "Continue improving until it is an A" turns the self-review into a loop instead of a report. The agent doesn't stop at "B+, mostly good." It keeps going until its own judgment says the work is done.

Sentence by Sentence

Every clause in that prompt is there because the agent did something specific that I didn't want it to do again.

"Is this done as rails would do?" This one targets shape. The agent will happily produce a working function where Rails has a mixin, or a top-level export where Rails has a class. Tests pass, structure is wrong. Naming the reference up front makes the agent check itself against the source project instead of against its own test file.

"What letter grade would you give this PR?" The forcing function. Without a grade, the agent writes a paragraph that describes what it did and calls it a review. With a grade, it has to commit to a judgment, which requires a comparison, which surfaces gaps.

"What could be improved?" This is the catch-all for things the agent knew were weak when it pushed. eslint-disable comments, TODOs, console.logs it forgot to remove, a variable name it wasn't happy with. Asked directly, the agent usually lists them unprompted. It just doesn't volunteer them.

"Is everything implemented and wired in and integrated?" Three words, three different failure modes. "Implemented" catches empty function bodies. "Wired in" catches methods that exist but aren't exported, or exports that aren't imported anywhere. "Integrated" catches the case where a module works in isolation but its caller was never updated to use it. The agent separates these in its answer. I've seen responses like "implemented yes, wired in yes, integrated no: the relation layer doesn't call it yet."

"Do not commit stubs." A direct prohibition. The agent interprets "improvement" very broadly, and without this line it will sometimes propose "improvements" that add more stubs in the name of scaffolding. This closes that door.

"Continue improving until it is an A." Turns it into a loop. Without this, the agent writes a self-review and stops. With this, the self-review becomes the start of another round of work.

"Mark PR as ready when it is an A." The exit condition. Otherwise the agent will loop forever, or stop in the middle and wait for me. The "ready" mark is also the signal to the rest of my automation that Copilot should take over.

Two Touchpoints

The grade prompt fires twice in a PR's life. The first time is when the draft opens, providing self-feedback before Copilot or I see anything. The agent has just finished a push and is feeling pretty good about itself. Asking for a grade at that moment reliably produces a list of things it knew it was cutting corners on but didn't mention in the commit message.

The second fires right before merge, after several rounds of Copilot comments and fixes. That one matters more than I expected. Long PR threads drift. The agent addresses comment after comment, fixing specifics, and by the end the code is a patchwork of responses to individual suggestions. The original intent can get buried. A final "what grade would you give this?" forces a fresh look at the whole diff instead of the last comment. Sometimes it finds that an earlier fix conflicted with a later one. Sometimes it notices that the file it was asked to update got updated, but the file that calls it didn't.

Why It Works

Self-grading only works if the grader has a reference. "Rails fidelity" is my reference: there's a specific codebase the port is supposed to match, and the agent can read it. Without that, the grade is just vibes. I've tried the same prompt on projects without a clear reference and the agent gives itself an A for effort.

The automation matters too. When I asked for self-review manually, I only did it when I already suspected something was off, which meant the prompt only fired after I'd done the work of noticing. Firing it on every draft PR, as part of the hook-to-agent loop, means the agent does the noticing for me. Most of the time it finds nothing interesting. Occasionally it finds a stub I would have missed or a lint disable I would have approved. Chasing percentages with emptiness rather than fidelity with real implementations. That's the pattern it catches, and over hundreds of PRs, it adds up.

It's a four-sentence prompt. I paste it into every project now.

The "Make it an A" Loop

Why a Grade

Sentence by Sentence

Two Touchpoints

Why It Works

Comments

Leave a Comment