Hungry Hungry Self-Hosted Action Runners

I set up a scalable pool of self-hosted GitHub Actions runners on my home machine to keep up with the pace of agent-driven PRs on blazetrailsdev/trails. The pool sizes to whatever the workload needs -- one dokku ps:scale command to dial it up. An agent that was excited about finally being able to use the whole machine dialed it up to sixteen. With that much parallelism, CI stops being the bottleneck -- jobs land instantly, agents iterate faster, more PRs land, more CI fires. A nice loop.

It is also a nice loop for downloading the same pnpm cache, several thousand times. The runners are ephemeral by design: each is a Docker container that mints a registration token, picks up exactly one job, exits, and gets restarted by Dokku as a fresh filesystem. Single dispatch per container, no work directory carried over, no node_modules or build output shared between jobs. That property is what makes them safe to run on a home machine -- every job starts from a known image -- but it also means every job pays the install cost from zero.

The pnpm store for this repo is around 430 MB on disk. The CI workflow runs eight or so parallel jobs per push, each doing its own pnpm install --frozen-lockfile, and over a day and a half the self-hosted runners chewed through 2,777 jobs. With nothing shared between containers, that is on the order of a terabyte of pnpm traffic in 36 hours, all of it pulling over my home connection.

A persistent pnpm-store mount was documented in the runner README from the first commit -- this was supposed to be the part that collapsed all those installs into a single download. It was not working. The bytes went over the wire 2,777 times when they should have gone over once.

My ISP noticed.

The immediate fix was the boring one: I turned the runners off and parked the experiment until the billing cycle resets. There is no clever shaping that gets you under a cap you have already blown through.

What I am putting in place before I bring them back is the part I should have started with -- visibility. I reached for vnstat, which is one of those quiet little Unix tools that has been doing exactly one thing exactly correctly since 2002. It sits on the interface, keeps a rolling database of bytes in and out, and answers any window you ask it for in milliseconds. No agent, no daemon you have to babysit, no dashboard -- just vnstat -h or vnstat -m and you have your answer. After dealing with the modern observability stack, opening a man page and finding a tool that already does the thing is its own small celebration. I piped it into my dokku-mail plugin with a small windowing script behind a single bandwidth-alert entrypoint, fired from cron at four cadences:

0  *    * * *  root /usr/local/bin/bandwidth-alert hourly
5  0    * * *  root /usr/local/bin/bandwidth-alert daily
10 0    * * 1  root /usr/local/bin/bandwidth-alert weekly
15 0    1,15 * * root /usr/local/bin/bandwidth-alert biweekly

Hourly catches a runaway loop in real time. Daily catches a slow leak. Weekly and biweekly catch the kind of usage trend that is invisible day-to-day but adds up to an ISP letter by the end of the month -- the shape I missed the first time. Next cycle, the pool comes back smaller, scales up gradually, and trips an alert long before it trips a bandwidth cap.

I had gotten excited. GitHub has been rough enough lately that people are openly walking away from it -- a cofounder of a company I used to work at just moved Ghostty off the platform. Leaving is not on my menu. I have a home server with a CPU from 2014 and a Dokku install, and pointing some of the CI load at it was, I thought, the pragmatic move: take the part of the pain that hardware can fix, and fix it.

The speedup was real. I just did not picture what sixteen workers running flat-out actually consume. An agent saw a knob labeled scale, turned it as far as it would go, and the bill came due two layers down the stack at my ISP.

Comments

Leave a Comment

Used for confirmation only. Not displayed publicly.