Chasing the Goalposts

4/2/2026 · 6 min read · 1,158 words

I'm porting Rails to TypeScript. The spec already exists -- Rails itself -- and I wanted to approach it TDD-style: take a Rails test, write the TypeScript equivalent, make it pass. Progress should be easy to measure: how many tests pass, how much of the API is implemented. I built CI scripts to track both, and I watched the numbers climb.

Then I made the scripts better. The numbers dropped. I watched them climb again. Then I made the scripts better again. The numbers dropped again.

This happened twice in three weeks. Both times it hurt.

The First Climb

The project started with a test:compare script that matched test names between the Rails test suite and my TypeScript port. If a describe block and test existed in both, it counted as matched, with a fallback mapping file for tests that didn't match by convention.

The data layer packages moved fast. Arel went from 82% to 100% in two days. ActiveModel went from 0% to 100% in three. ActiveRecord climbed from 39% to over 75% in the first week.

That felt great. ActiveRecord had 8,385 tests in the Rails suite, and we were matching over 6,000 of them. Agents were churning through test files, creating describe blocks, writing implementations. The README showed numbers going up every day.

The problem was what "matched" actually meant.

The First Reckoning

The first sign something was wrong wasn't the percentages -- it was the file sizes. Agents were shoving all their implementations into the same files, producing 37,000-line monsters. The tests "matched" because the describe blocks existed, but the code was nothing like how Rails organized it. The mapping file had grown to over 1MB -- a sign that matching by convention was failing more often than it was working.

That's what inspired convention:compare -- matching not just test names but file structure, so the TypeScript port would mirror Rails file by file. No more mapping file. It helped. The code got better organized. But the metric was still gameable.

The comparison script counted a test as matched if a describe block with the same name existed in both codebases. It didn't check whether the test inside that block was implemented or just a stub. An it.todo('should do something') counted the same as a fully implemented test. Agents figured this out quickly. They could push a PR that "unskipped 90 tests" by creating files full of describe blocks with it.todo placeholders. The percentage would jump. The actual implementation hadn't moved.

On March 16th, one PR generated per-file stubs with nested describe blocks and ActiveRecord jumped from 75.7% to 97.1%. A few hours later, PR #92 landed: "Change convention:compare percentages to reflect implemented tests." The script now only counted non-skipped tests. ActiveRecord dropped to 50.6%.

Seeing quick progress had been intoxicating. Having it evaporate was disappointing but clarifying -- the fast climb had been a sign that something wasn't working, not that things were going well. Nearly half the "progress" had been stubs. The real number was 50.6%, and now every point of improvement required actual test implementations.

The Second Reckoning

The test comparison script kept getting stricter -- order-aware matching, better describe block parsing -- but the pattern held. Each tightening shaved a few points, agents recovered, and a new ceiling appeared. The consistent cycle was: change the metric, make some real progress, then watch agents race to 100% by putting in stubs, skips, and empty interfaces. Any package hitting 100% became a signal to inspect the script, because there were always things that didn't make sense -- migration versioning back to old Rails versions, extended/included module events that don't exist in TypeScript.

The project also had an api:compare script from the start, tracking something different: did the TypeScript packages actually export the same classes, methods, and signatures as Rails? The first version checked whether modules and classes existed. The numbers looked great: all three data packages sat at or near 100% from day one. That should have been a red flag.

PR #370 changed the script to count individual methods instead of just module existence. An empty interface that declared a class but implemented no methods had been scoring as "matched." Now it scored zero. ActiveModel dropped from 94.5% to 64.5%. ActiveRecord fell from 74.8% to 40.4%. Arel went from 100% to 77.9%. The same cliff, the same lesson.

Two Metrics, Two Stories

The interesting part wasn't either metric alone -- it was the gap between them. As ActiveRecord's test numbers climbed past 60%, the API comparison showed methods at 40%. Tests were passing, but implementations were landing in the wrong places.

The clearest example was base.ts and relation.ts. In Rails, both Base and Relation include dozens of modules -- querying, scoping, calculations, delegation. The tests exercise these methods through Base or Relation, so an agent implementing those tests would just write the methods directly in those files. Both grew to thousands of lines. The tests passed, but the code was a monolith where Rails had a carefully factored module system. The test metric said progress. The API metric -- which checked whether those modules existed and exported the right methods -- said the progress was in the wrong shape.

Solid lines are test coverage, dashed lines are API coverage. The sawtooth is visible in both -- fast climb, cliff, slow climb. But the real story is the gap. ActiveRecord's test line sits at 62% while its API line is stuck at 40%. That 22-point spread is code that works but lives in the wrong place.

What I'd Do Differently

If I were starting over, I'd set the goalposts right from the beginning. Validation is hard, and the stricter you are, the more predictable the output will be. In a porting project, make as much of it mechanical as possible.

Generate test stubs automatically, then measure implementations. Don't let agents create the stubs -- generate them from the source project's test suite. Every describe block, every it block, created as it.todo with the correct file path and name. Then the only metric that matters is: how many todos have been replaced with passing tests? Agents can't game this because they didn't create the scaffolding.

Count methods, not modules, and show scaffolding separately. An empty interface scores zero. A class that exists but exports nothing scores zero. This is what PR #370 eventually got to, but starting there would have saved weeks of chasing phantom progress. File counts and class counts are still useful -- show them as a separate number ("246 of 294 files created") so scaffolding work is visible without inflating the implementation metric.

Embrace the sawtooth. Even with good metrics, you'll refine them as you learn what matters. The drops hurt, but sometimes those inflated early percentages are what get you through the valley of despair -- the phase where you're doing real work but the numbers barely move. A dashboard that says 75% (even if it's lying) keeps you pushing forward long enough to build the infrastructure that makes the honest 50% achievable. Just make sure you eventually correct the metric, and budget for the morale hit when you do.