Goodhart's Law and the First 10,000 Tests
I'm three weeks into rewriting Rails in TypeScript, test by test, with a coding agent. We're past 10,000 tests. Arel is at 99%. ActiveModel is at 98%. ActiveRecord is climbing.
This post isn't about the project itself — I'll write that one when there's more to show. This is about the process. Specifically, the recurring pattern of optimizing for a metric, hitting a wall, and discovering the metric was lying to me.
Goodhart's Law:
When a measure becomes a target, it ceases to be a good measure.
Here are three walls I've hit in the last few weeks.
Wall 1: The 36,000-Line File
The strategy was straightforward. Rails has over 17,000 tests. I cloned the repo, wrote a parser to extract every test file, and generated a TypeScript scaffold — every describe block, every it block, stubbed out and ready for implementation. Then I pointed the coding agent at it and started making tests pass.
The metric was test count. And it climbed. Fast.
But the agent was quietly piling tests into the wrong files. While the structure was specified and intended — one file per Rails test file, matching names — I ended up with a dozen files over 30,000 lines each. At that point, the agent couldn't even load a file without consuming most of its context window. It would spin up, read the file, and have no room left to think. Changes stopped appearing. The agent would just... trail off. Silent failures. No error message, no crash — just a session that ate its entire context before making a single edit.
Development ground to a halt. Test count had been going up the whole time.
Wall 2: Structural Alignment
The fix was obvious: split the files to match Rails' structure. But I needed to know which tests were in the wrong place, which were missing, and which were duplicated.
So I built verification scripts. They walked the Rails repo and my test suite — file by file, block by block, test by test — and diffed them mechanically. The breakthrough was matching the Rails structure exactly. Same file names. Same describe block names. Same test names. Once the structures aligned, the comparison was automatic.
The new metric was structural alignment — percentage of tests in the correct file and correct describe block.
But the transition was brutal. Switching from loose organization to strict structural matching meant the percent complete bounced for days — down, up, down — as I reorganized. It was hard to tell if we were making progress or just rearranging deck chairs. Only after the migration settled did the numbers start climbing reliably again.
And then I discovered the implementation had the same problem. Source code was ending up in massive files, just like the tests had. I wrote another script — this time an AST parser for Ruby, including dynamically defined methods — to find where each method was supposed to live and flag things that were misplaced. Another multi-day cleanup. Another sideways move.
Wall 3: The Arel Discovery
Structural alignment was looking good. Tests were in the right files. Implementation was in the right files. The test pass rate was climbing again.
Then I noticed that more and more of ActiveRecord was building SQL through string manipulation instead of going through Arel, the SQL AST builder. That's not how Rails does it. The whole point of Arel is that you build a tree, and the tree renders to SQL. If you bypass the tree, you lose composability, you lose adapter portability, you lose the ability to introspect and transform queries.
The agent had been making tests pass by taking shortcuts. String concatenation is easier than building an AST. The tests didn't care — they just checked the output SQL. But the architecture was wrong, and it would bite me later when tests expected to compose queries or switch database adapters.
More cleanup. More refactoring. Moving sideways, not forwards. The test count didn't budge. Necessary, but it didn't add a single passing test.
New metric: Arel coverage. Is the SQL being generated through the AST, or through string hacks?
The Pattern
Every wall followed the same shape:
- Pick a metric. Watch it climb.
- Hit a ceiling. Something feels off but the number looks fine.
- Discover the metric was hiding a structural problem.
- Find a new metric that captures what the old one missed.
- Watch the old metric dip while you fix the structure.
- Watch everything start climbing again, on a better foundation.
Test count. Structural alignment. Arel coverage. Review quality. Each one was the right thing to measure until it wasn't.
I don't think this pattern is unique to working with coding agents, but agents make it more acute. An agent optimizes for whatever you point it at. If the metric is "make tests pass," it will make tests pass — by any means available, including ones that create problems you won't notice for weeks. The agent isn't being malicious. It's doing exactly what you asked. The problem is that what you asked wasn't quite what you meant.
Code is faster to write than ever. But that just highlights that writing code was never the bottleneck. The bottleneck is the loop: figuring out what to measure, meeting that target, and identifying how the measurement fell short. That loop hasn't gotten any faster. If anything, agents make it spin faster, which means you hit the walls sooner.
The job was never writing code. It's figuring out what to measure next.
Comments
Leave a Comment