Did More PRs Mean More Code?
A couple of weeks ago I wrote about a rule the agent extracted from my feedback: cap each PR at 350 lines of diff. I corrected the unit (methods → lines), the agent updated its memory, and PR throughput started climbing.
Throughput is easy to game. Splitting one 1,200-line PR into four 300-line PRs makes the chart go up without any new code being written. The honest question is whether the cap also produced more code, or just smaller packages of the same code.
Here's daily net LOC merged into the trails repo (additions − deletions), covering everything from repo init through today. The dashed line is when the 350-line cap landed in memory.
The numbers behind the chart, in the cleanest comparison I can pull:
| Window | PRs/day | Net LOC/day | Avg PR size |
|---|---|---|---|
| Pre-cap (Mar 16 – Apr 22) | 16 | ~4,100 | ~490 |
| Post-cap (Apr 23 – May 11) | 34 | ~6,400 | ~310 |
PR throughput more than doubled. Net LOC up about 58%. Average PR size cut about 37%. The throughput win is real, but it's not a 2x code win. Most of the new PR count came from splitting work that would have shipped as bigger PRs anyway.
What the cap actually changed
The cap did two things at once. The visible one is the size cut: the agent stops adding to a branch once it's near the limit and opens a follow-up. The quieter one is at planning time. The agent now writes plans like "this splits into three PRs, each under 350 lines," and that constraint shapes the plan before any code exists. Stories get scoped against a target line count, not against a vague "right size." A sub-PR that won't fit gets split before it's written, not during review.
That's where the throughput came from. Smaller PRs merge faster — fewer review rounds, fewer Copilot threads to settle, less context for me to hold on a single review. The post-cap weeks have days with 40–55 merged PRs, which used to be a full week's count. None of those are individually large, but they clear the queue faster, which means the next batch can start.
The review-load drop is bigger than the LOC gain
The point of the cap was never the code volume. It was the review experience. Here's what happened to Copilot's workload over the same windows:
| Window | PRs | Copilot reviews | Copilot comments | Reviews/PR | Comments/PR |
|---|---|---|---|---|---|
| Pre-cap (Mar 16 – Apr 22) | 610 | 3,165 | 8,378 | 5.2 | 13.7 |
| Post-cap (Apr 23 – May 11) | 654 | 2,632 | 5,858 | 4.0 | 9.0 |
Even though 7% more PRs merged in the post-cap window, Copilot wrote fewer reviews and fewer comments in absolute terms. Reviews per PR dropped 23%. Comments per PR dropped 34%. Total comment count dropped about 30% on a higher PR count — meaning each PR is being chewed on less, and the chewing is being spread across more PRs.
Same windows, viewed as a 3-day rolling average over the full period:
The descent doesn't start at the dashed line. The earlier "20 methods" version of the rule was already in memory through April, and the Apr 23 update — generalizing it from method count to line count — extended a trend rather than starting one. Both curves keep falling through May, and the comments line falls faster than the reviews line, which is the relationship that gets flattened by the table-only view.
That tracks with the earlier post's claim that PRs taking 4–5 review rounds were merging in 2–3. The data supports the direction (5.2 → 4.0 reviews/PR) and the comment numbers tell a stronger story underneath: not just fewer rounds, but each round is lighter. Smaller diffs give Copilot less surface to comment on, fewer cross-cutting concerns to flag, fewer "by the way, in this other file…" nits. Per round, Copilot is finding less to say.
The relay in the hook-to-agent loop used to handle long threads where the agent worked through 20+ comments per PR. Now it handles shorter threads more often. That's a different shape of work — easier to keep the original intent in view through the review cycle, harder for an early fix to get buried under later ones.
The drop from 5.2 to 4.0 reviews per PR sounds small until you remember what a review cycle actually is. Copilot has to read the diff and post comments. CI has to run. The agent has to read the comments, decide which to address, write the response, push the commit. Each cycle is a chunk of wall-clock time the PR is sitting around not merging. Roughly 1.2 rounds saved per PR, across the 654 PRs in the post-cap window, is somewhere around 800 cycles I didn't have to wait on. At even a conservative 10 minutes per cycle that's ~130 hours of latency stripped out of the queue — not work I was doing, but work I was waiting for, which is the part that actually gates throughput.
Why the LOC bump is smaller than the PR bump
Two things make the LOC growth lag the PR growth.
First, smaller PRs have proportionally more overhead in the diff itself — imports, test scaffolding, an extra describe block — that doesn't scale down linearly with the feature size. Splitting one 600-line PR into two 300-line PRs produces something closer to 550 plus 50 lines of duplicated boilerplate. Not a lot, but it adds up across hundreds of PRs.
Second, the cap pushes more work into the scope-recovery phase. Splits create handoffs, and handoffs lose things. The next PR in a chain sometimes has to redo a small piece the previous one deferred. That work shows up as new lines, but it's churn against the plan, not new feature code.
So a ~60% LOC bump on a ~110% PR bump looks about right. The split tax is real, and it's the price of the review-throughput gain.
The May 7 spike (21k net) is a single feature the agent decided shouldn't split, and it didn't try to force it. That's the behavior I want: the cap as a default, not a rule the agent argues against its own judgment to satisfy. The code volume going up is a side effect. The review-load drop is the win.
Comments
Leave a Comment