Did More PRs Mean More Code?

A couple of weeks ago I wrote about a rule the agent extracted from my feedback: cap each PR at 350 lines of diff. I corrected the unit (methods → lines), the agent updated its memory, and PR throughput started climbing.

Throughput is easy to game. Splitting one 1,200-line PR into four 300-line PRs makes the chart go up without any new code being written. The honest question is whether the cap also produced more code, or just smaller packages of the same code.

Here's daily net LOC merged into the trails repo (additions − deletions), covering everything from repo init through today. The dashed line is when the 350-line cap landed in memory.

-10,000 +0 +10,000 +20,000 2026-03-12 +41,518 2026-03-13 -306 2026-03-14 -9,898 2026-03-15 +2,362 2026-03-16 +4,357 2026-03-17 +5,189 2026-03-18 +6,610 2026-03-19 +2,544 2026-03-20 +4,601 2026-03-21 +3,986 2026-03-22 +2,769 2026-03-23 +3,651 2026-03-24 -11,488 2026-03-25 +1,703 2026-03-26 +3,465 2026-03-27 +2,815 2026-03-28 +3,921 2026-03-29 +1,957 2026-03-30 +8,011 2026-03-31 +15,088 2026-04-01 +9,927 2026-04-02 +4,145 2026-04-03 +594 2026-04-04 +2,061 2026-04-05 +3,211 2026-04-06 +5,283 2026-04-07 +9,611 2026-04-08 +951 2026-04-09 +3,530 2026-04-10 +4,733 2026-04-11 +678 2026-04-12 +173 2026-04-13 +1,620 2026-04-14 +6,878 2026-04-15 +7,777 2026-04-16 +7,071 2026-04-17 +4,907 2026-04-18 +6,262 2026-04-19 +4,510 2026-04-20 +6,228 2026-04-21 +2,361 2026-04-22 +3,139 2026-04-23 +10,932 2026-04-24 +13,273 2026-04-25 +5,396 2026-04-26 +12,047 2026-04-27 +10,192 2026-04-28 +4,010 2026-04-29 +8,867 2026-04-30 +5,791 2026-05-01 +1,564 2026-05-02 +4,642 2026-05-03 +2,805 2026-05-04 +6,480 2026-05-05 +3,414 2026-05-06 +6,086 2026-05-07 +21,383 2026-05-08 +3,650 2026-05-09 +900 2026-05-10 +207 2026-05-11 +751 ↑ 41,518 (repo init) 350-line PR cap Mar 12 Apr 01 May 01 net lines of code per day (additions − deletions)

The numbers behind the chart, in the cleanest comparison I can pull:

Window PRs/day Net LOC/day Avg PR size
Pre-cap (Mar 16 – Apr 22) 16 ~4,100 ~490
Post-cap (Apr 23 – May 11) 34 ~6,400 ~310

PR throughput more than doubled. Net LOC up about 58%. Average PR size cut about 37%. The throughput win is real, but it's not a 2x code win. Most of the new PR count came from splitting work that would have shipped as bigger PRs anyway.

What the cap actually changed

The cap did two things at once. The visible one is the size cut: the agent stops adding to a branch once it's near the limit and opens a follow-up. The quieter one is at planning time. The agent now writes plans like "this splits into three PRs, each under 350 lines," and that constraint shapes the plan before any code exists. Stories get scoped against a target line count, not against a vague "right size." A sub-PR that won't fit gets split before it's written, not during review.

That's where the throughput came from. Smaller PRs merge faster — fewer review rounds, fewer Copilot threads to settle, less context for me to hold on a single review. The post-cap weeks have days with 40–55 merged PRs, which used to be a full week's count. None of those are individually large, but they clear the queue faster, which means the next batch can start.

The review-load drop is bigger than the LOC gain

The point of the cap was never the code volume. It was the review experience. Here's what happened to Copilot's workload over the same windows:

Window PRs Copilot reviews Copilot comments Reviews/PR Comments/PR
Pre-cap (Mar 16 – Apr 22) 610 3,165 8,378 5.2 13.7
Post-cap (Apr 23 – May 11) 654 2,632 5,858 4.0 9.0

Even though 7% more PRs merged in the post-cap window, Copilot wrote fewer reviews and fewer comments in absolute terms. Reviews per PR dropped 23%. Comments per PR dropped 34%. Total comment count dropped about 30% on a higher PR count — meaning each PR is being chewed on less, and the chewing is being spread across more PRs.

Same windows, viewed as a 3-day rolling average over the full period:

0 3 6 9 12 0 10 20 30 350-line PR cap Mar 12 Apr 01 May 01 — Copilot reviews / PR (left) — Copilot comments / PR (right) 3-day rolling avg

The descent doesn't start at the dashed line. The earlier "20 methods" version of the rule was already in memory through April, and the Apr 23 update — generalizing it from method count to line count — extended a trend rather than starting one. Both curves keep falling through May, and the comments line falls faster than the reviews line, which is the relationship that gets flattened by the table-only view.

That tracks with the earlier post's claim that PRs taking 4–5 review rounds were merging in 2–3. The data supports the direction (5.2 → 4.0 reviews/PR) and the comment numbers tell a stronger story underneath: not just fewer rounds, but each round is lighter. Smaller diffs give Copilot less surface to comment on, fewer cross-cutting concerns to flag, fewer "by the way, in this other file…" nits. Per round, Copilot is finding less to say.

The relay in the hook-to-agent loop used to handle long threads where the agent worked through 20+ comments per PR. Now it handles shorter threads more often. That's a different shape of work — easier to keep the original intent in view through the review cycle, harder for an early fix to get buried under later ones.

The drop from 5.2 to 4.0 reviews per PR sounds small until you remember what a review cycle actually is. Copilot has to read the diff and post comments. CI has to run. The agent has to read the comments, decide which to address, write the response, push the commit. Each cycle is a chunk of wall-clock time the PR is sitting around not merging. Roughly 1.2 rounds saved per PR, across the 654 PRs in the post-cap window, is somewhere around 800 cycles I didn't have to wait on. At even a conservative 10 minutes per cycle that's ~130 hours of latency stripped out of the queue — not work I was doing, but work I was waiting for, which is the part that actually gates throughput.

Why the LOC bump is smaller than the PR bump

Two things make the LOC growth lag the PR growth.

First, smaller PRs have proportionally more overhead in the diff itself — imports, test scaffolding, an extra describe block — that doesn't scale down linearly with the feature size. Splitting one 600-line PR into two 300-line PRs produces something closer to 550 plus 50 lines of duplicated boilerplate. Not a lot, but it adds up across hundreds of PRs.

Second, the cap pushes more work into the scope-recovery phase. Splits create handoffs, and handoffs lose things. The next PR in a chain sometimes has to redo a small piece the previous one deferred. That work shows up as new lines, but it's churn against the plan, not new feature code.

So a ~60% LOC bump on a ~110% PR bump looks about right. The split tax is real, and it's the price of the review-throughput gain.

The May 7 spike (21k net) is a single feature the agent decided shouldn't split, and it didn't try to force it. That's the behavior I want: the cap as a default, not a rule the agent argues against its own judgment to satisfy. The code volume going up is a side effect. The review-load drop is the win.

Comments

Leave a Comment

Used for confirmation only. Not displayed publicly.