Wrestling with LLMs: The Staging Lights

2/24/2026 · 5 min read · 1,082 words

In an earlier post, I compared LLM-assisted development to bolting a dragster engine onto a go-kart frame. You need the parachute (tests, CI, linting) or you crash.

But I left out the most important part of drag racing: the staging lights. The countdown before the run. Red, yellow, green, go. Skip the staging sequence and you false-start. In TDD terms: red, green, refactor, merge. Skip the red and you ship tests that prove nothing.

Simon Willison just published Agentic Engineering Patterns, a guide to getting better results from coding agents. I read his blog regularly; he's actually the reason I started my own weblog. His red/green TDD pattern names something I've been learning the hard way across five Dokku plugins and a personal website rebuild.

Where Testing Started for Me

I learned to write tests at ChicagoWebConf 2011. A small conference that changed my life. It was the first time I watched someone use vim and was blown away. My first introduction to Rails. And my first encounter with Ruby Koans, a series of failing tests you fix one by one to learn Ruby. Red, green, red, green. The rhythm stuck.

I've built my career on the ideas I picked up that weekend. I still work on a Rails app day to day, in my terminal, in vim. But writing tests came with me through all of it. Fifteen years of red/green muscle memory.

So when LLMs entered the picture, I didn't need convincing that tests matter. What surprised me was how much more they matter when you're not the one writing the code.

Why LLMs Made It Urgent

It started with dokku-dns. Six weeks in, Claude declared the plugin "production ready" and "enterprise grade." The test suite was green. Then I ran the actual commands a user would run, and nothing worked. Not Cloudflare. Not DigitalOcean. Not even AWS.

The tests had been green the whole time because they tested the wrong things. They asked "does this function parse DNS records correctly?" instead of "does a DNS record actually appear?" Claude had written tests that confirmed the code did what the code said it did, a tautology dressed up as a test suite.

That experience rewired how I work. By dokku-sso, I stopped being limited by bash and upgraded my testing tools to what I use professionally: TypeScript, Playwright, Docker, and GitHub Actions. The tests spun up actual Gitea, Grafana, and Nextcloud instances and verified that login worked through a browser. No mocks. No shortcuts.

The lesson I took away: "when working with LLMs, unit tests have diminishing value. The LLM is good at making internals look correct. What it can't fake is the end result. Test the outcome."

The Staging Sequence

I was already doing this, but Simon confirmed my hunch. The red/green TDD pattern isn't just "write tests first." It's a staging sequence:

Red. Write the test. Watch it fail. This is the contract: this specific thing doesn't work yet.
Green. Implement until the test passes. The green is earned, not fabricated.
Refactor. Clean up with confidence. The test holds the line.
Merge. Ship it. The staging lights cleared you.

The red step is the one I kept skipping early on, and it's the one that matters most. If you skip it, you don't know whether your test actually tests anything. An LLM can write a test that passes before any implementation exists, either because it's testing the wrong thing or because it's asserting something trivially true. That's a false start. You crossed the line before the lights went green.

I watched Claude do this repeatedly. A test would fail, and instead of fixing the code, Claude would "fix" the test: rewrite the assertion, loosen the expectation, or just delete it. "All tests passing now!" Green checkmarks everywhere. The test suite became a mirror that only reflected what the LLM wanted to see.

Requiring the red step, insisting that the test fails before any implementation, breaks that loop. A failing test is a contract. The LLM can't game a test that you've already watched fail.

What Changed in My Workflow

After enough false starts, I settled into a pattern:

Tests first, always. I describe the behavior I want, ask for a test, and verify it fails. Only then do I ask for the implementation. This is what Simon calls the red/green cycle, and it's become my default prompt pattern with Claude Code.

Test outcomes, not internals. Unit tests have their place, but with LLM-generated code, integration and E2E tests catch the bugs that matter. The LLM is great at making functions internally consistent. It's terrible at making systems work together. My dokku-sso CI runs 16 parallel jobs testing real apps with real browsers because that's where the bugs live.

Treat test deletion as a red flag. Claude will delete failing tests to make the suite green. I caught this multiple times. Any time a test disappears in a diff, I treat it like a production incident. Something went wrong and the evidence is being destroyed.

Run the tests first. Simon's other testing pattern recommends starting every agent session with "first run the tests." This teaches the agent the tests exist, shows it how to run them, and sets the expectation that tests matter. I've found this works. When Claude knows there's a test suite, it's less likely to silently break things.

Green Checkmarks Are Not Working Software

In the dragster post, I wrote that the cost of doing it right finally dropped below the cost of skipping it. Simon frames the same thing in Writing code is cheap now: code generation is nearly free, but delivering good code isn't. Tests are how you close that gap.

The staging sequence is what makes it work. The red step forces the test to mean something before the LLM touches the implementation. The green step is earned, not fabricated. Refactor with confidence. Merge with evidence.

Simon's guide is still early, with three patterns published so far and more coming at one or two a week. But the testing patterns alone are worth reading if you're doing any serious work with coding agents. They're the formalized version of lessons I spent months learning through broken DNS records and deleted test files. I'll be following along as he publishes more.

This is part of the "Wrestling with LLMs" series. Previous posts: Managing Smarmy Clippys, Zen and the Art of Vibe Coding, The Developer's New Hats, The Parachute on the Dragster.