Making Copilot Talk Back to Agents

3/30/2026 · 8 min read · 1,688 words

I run coding agents in tmux sessions on my server. Not my laptop. My server. This means I can close my laptop, walk away, and the agent keeps working. When I come back, there's a PR waiting. That part works great.

The project I'm using this on is a TypeScript port of Rails. The spec already exists (Rails itself), so there are hundreds of skipped tests waiting to be unskipped as each module gets implemented. The agent's job is two-part: make sure both the implementation and the tests match Rails as closely as possible. That clarity is part of why this loop works as well as it does.

The part that didn't work was everything after the push.

The Human Relay

The agent pushes a PR. Copilot reviews it. Tests run. Comments appear. And then nothing happens, because the agent doesn't know any of that occurred. It's sitting in a tmux pane, cursor blinking, waiting for me to tell it what GitHub said.

So I'd open GitHub, read the review comments, copy the URLs, switch to my terminal, paste them in, and type something like "copilot left these comments, please address them." Then I'd wait for the fix, watch the push, go back to GitHub, and check if Copilot reviewed again. It usually didn't. I'd click around the UI to retrigger it. I can't count how many times I typed "copilot reviewed" into a terminal.

The gh CLI should have helped, but it had its own problems. Comments are paged at 30, and the agent didn't understand that new comments were at the end. It would fetch the first page, see old comments, and either try to address already-resolved feedback or conclude there was nothing new. Sometimes I resorted to pasting individual comment URLs directly because the agent couldn't find them through the API.

I was the bottleneck. Not because I was slow at reviewing, but because I was slow at relaying.

hook-to-agent

I built a small Go app called hook-to-agent and deployed it to my Dokku server. It receives GitHub webhooks and does one thing: gets the right information to the right agent.

When a webhook arrives (a review, a comment, a test failure, a merge), the app figures out which tmux pane should receive the notification. Because webhooks fire per-event, the agent only sees new comments. No pagination, no sifting through already-resolved feedback to find what's fresh.

That's the interesting part. My agents always use git worktrees, so every active branch has its own directory. hook-to-agent scans all tmux panes for their working directories, checks which directory is a worktree for the branch in the webhook, and sends a short notification to that pane with tmux send-keys. When a PR is first opened, the agent gets a nudge to self-review before Copilot even sees it. Merge events notify the agent too, so it knows when its work lands.

The Copilot Problem

GitHub Copilot has a "review on every push" setting. I have it checked. It almost never works. The rest of the time, you have to manually trigger a review through the UI.

I looked for an API. There isn't a real one, or if there is, they don't want you using it. I spent more time than I'd like to admit trying to make it work programmatically before giving up.

So I did the obvious thing: I drove a browser.

I have another service called remote-firefox-driver that I originally built for a different project: scraping financial sites where headless browsers get banned and 2FA makes credential embedding impractical. It communicates with a Firefox extension that drives my real browser, so I get my actual session and password manager without embedding credentials in an app. It's off by default and only works on my home network, because it's a security nightmare otherwise.

But it turns out "click this button on a webpage while logged in" is a recurring need when APIs don't exist. After hook-to-agent sees a push event, it tells remote-firefox-driver to open the PR page and click the buttons to trigger a Copilot review.

It's a Rube Goldberg machine. It works.

The Loop

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#E6D3B8', 'primaryTextColor': '#3D332A', 'primaryBorderColor': '#A2680B', 'lineColor': '#A2680B', 'secondaryColor': '#F1E0C8', 'tertiaryColor': '#E8D5BF', 'edgeLabelBackground': '#F1E0C8', 'nodeTextColor': '#3D332A', 'nodeBorder': '#A2680B', 'mainBkg': '#E6D3B8', 'clusterBkg': '#F1E0C8'}}}%%
graph TD
    A[Agent pushes PR] --> B[GitHub webhook]
    B --> C[hook-to-agent]
    C --> D{Event type?}
    D -->|Push| E[remote-firefox-driver]
    E --> F[Trigger Copilot review]
    F --> B
    D -->|PR opened| H
    D -->|Review / Comment| H
    D -->|CI failure| H
    D -->|Merge| H
    H[Find tmux pane by worktree]
    H --> I[tmux send-keys]
    I --> J[Agent reads & fixes]
    J --> A

The agent pushes a PR. When the PR opens, the agent gets a self-review prompt. The push also triggers remote-firefox-driver to kick off a Copilot review. Copilot posts comments. The webhook fires. hook-to-agent sends them to the right tmux pane. The agent reads the comments and pushes a fix. The cycle repeats. Test failures and merges follow the same path; CI results and merge events arrive as webhooks and land in the right pane.

By the time I sit down to review, the PR has already been through a few rounds. The agent addressed Copilot's feedback, tests are passing, and I'm looking at something closer to mergeable. I'm still the one who reviews and merges, but the work that reaches me is in better shape than it used to be.

Two AIs Arguing

What emerged from the loop wasn't just faster feedback. It was a division of labor I didn't plan. Copilot is good at thinking about the whole system. It catches half-finished features, notices when a change has implications elsewhere, and flags work that's been stubbed out but never completed. The coding agent is good at writing and improving code, but it has a tendency to leave stubs and move on.

I leaned into this. I added specific instructions to my Copilot review configuration: leave unimplemented tests as skip instead of creating empty stubs, and ask for unfinished features to be added to the roadmap rather than left as dead code. This turned Copilot into something like a scope linter. It doesn't just review style, it reviews completeness.

A typical PR goes through 15 to 30 messages between them. The most dramatic one started as "unskip 90 tests," then became "unskip 99 more tests," then "unskip 75 more tests," and finally landed at "unskip 23 more tests." The agent kept writing tests that looked right but didn't actually test anything: assertions that would pass even if the implementation were broken, tests named after behaviors they never exercised, date predicates that just checked isToday(new Date()) (which passes even if isToday always returns true). Copilot flagged these across 7 review rounds and 30 comments. Each round the agent either fixed the test or re-skipped it, and the scope shrank until what remained was actually correct. Eleven commits to go from ambition to honesty.

Another tried to take an API compatibility score from 58% to 100% by creating interfaces for every module in the system. The agent churned out 41 files of interface definitions, and almost none of them matched the actual implementation. Methods declared as properties, instance methods as class methods, type signatures that didn't match what the code actually accepted. Copilot caught all of it across 28 comments.

Some PRs have hit the 70-90 comment range, usually variations of that second pattern: the agent creates a lot of stubs and Copilot systematically asks for each one to be either implemented or removed. Those long threads used to require me in the middle, deciding which comments to relay and which to ignore. When I was the relay, I'd skip Copilot's low-confidence suggestions to save time. Now the agent reads everything, including the marginal finds, and occasionally those turn out to be worth addressing. The automation is less selective than I was, which paradoxically makes it better. My role shifted from mediator to final approver.

Sometimes the agent quits with Copilot messages still outstanding. When that happens, I'll read through the remaining comments. If I agree with the agent that they're not worth addressing, I merge. Otherwise I ask for fixes. But that's a judgment call on a handful of comments, not the full relay job it used to be.

What Went Wrong

The first version didn't filter by event type. Merge events triggered notifications, and they'd land in the first tmux pane that matched the repository, not the branch, because the branch was gone. That agent was understandably confused about the stream of messages it was receiving about merges it didn't initiate.

Worktree matching is still fragile. The whole routing system depends on agents working in a worktree whose directory matches the branch name, but the CLI always starts from the main repo root. If I'm resuming work on an existing branch, the agent's starting directory doesn't match the worktree. Sometimes agents forget to create a worktree at all. There's no good fix for this yet. It's a convention that the agent doesn't always follow, and when it doesn't, webhooks land in the wrong pane or nowhere.

I also hit rate limits quickly. The feedback loop was genuinely faster, which meant more pushes, more reviews, more API calls. The bottleneck shifted from me to the rate limit. There's no API for checking usage limits either, so I ended up adding another remote-firefox-driver job that scrapes my usage dashboard and pauses notifications when I hit 80% of my limit, so agents wind down gracefully instead of slamming into a wall mid-conversation.

What This Changed

Before, I was a relay. I moved information between systems that couldn't talk to each other, copy-pasting and retriggering. Now I'm a judge. Two AIs argue about the code (one writes it, one reviews it) and I decide when they're done.

That's a better use of my attention. The agent is fast but sloppy. It optimizes for the metric in front of it (test count, API coverage percentage, files changed) without always doing the work those metrics represent. Copilot is slow but thorough. It reads the whole PR, checks whether the code matches its claims, and flags the gaps. Neither is sufficient alone. But the loop between them, when it's working, produces code that's closer to what I'd write myself than what either would produce independently.

The irony is that the hardest part to automate was the thing that should have been easiest. Copilot has the infrastructure to review automatically. It just doesn't, reliably. So I'm driving a browser to click buttons on a webpage to trigger an AI review that feeds back into a terminal AI. It's absurd. But it works, and that's what matters when you're trying to close a loop.