Hungry Hungry Self-Hosted Action Runners
I set up a scalable pool of self-hosted GitHub Actions runners on my
home machine to keep up with the pace of agent-driven PRs on
blazetrailsdev/trails. The
pool sizes to whatever the workload needs -- one dokku ps:scale
command to dial it up. An agent that was excited about finally being
able to use the whole machine dialed it up to sixteen. With that much
parallelism, CI stops being the bottleneck -- jobs land instantly,
agents iterate faster, more PRs land, more CI fires. A nice loop.
It is also a nice loop for downloading the same pnpm cache, several
thousand times. The runners are
ephemeral by design:
each is a Docker container that mints a registration token, picks up
exactly one job, exits, and gets restarted by Dokku as a fresh
filesystem. Single dispatch per container, no work directory carried
over, no node_modules or build output shared between jobs. That
property is what makes them safe to run on a home machine -- every job
starts from a known image -- but it also means every job pays the
install cost from zero.
The pnpm store for this repo is around 430 MB on disk. The CI workflow
runs eight or so parallel jobs per push, each doing its own pnpm install --frozen-lockfile, and over a day and a half the self-hosted
runners chewed through 2,777 jobs. With nothing shared between
containers, that is on the order of a terabyte of pnpm traffic in 36
hours, all of it pulling over my home connection.
A persistent pnpm-store mount was documented in the runner README
from the first commit -- this was supposed to be the part that
collapsed all those installs into a single download. It was not
working. The bytes went over the wire 2,777 times when they should
have gone over once.
My ISP noticed.
The immediate fix was the boring one: I turned the runners off and parked the experiment until the billing cycle resets. There is no clever shaping that gets you under a cap you have already blown through.
What I am putting in place before I bring them back is the part I
should have started with -- visibility. I reached for
vnstat, which is one of those quiet little
Unix tools that has been doing exactly one thing exactly correctly
since 2002. It sits on the interface, keeps a rolling database of
bytes in and out, and answers any window you ask it for in
milliseconds. No agent, no daemon you have to babysit, no dashboard --
just vnstat -h or vnstat -m and you have your answer. After
dealing with the modern observability stack, opening a man page and
finding a tool that already does the thing is its own small
celebration. I piped it into my
dokku-mail plugin with a
small windowing script behind a single bandwidth-alert entrypoint,
fired from cron at four cadences:
0 * * * * root /usr/local/bin/bandwidth-alert hourly
5 0 * * * root /usr/local/bin/bandwidth-alert daily
10 0 * * 1 root /usr/local/bin/bandwidth-alert weekly
15 0 1,15 * * root /usr/local/bin/bandwidth-alert biweekly
Hourly catches a runaway loop in real time. Daily catches a slow leak. Weekly and biweekly catch the kind of usage trend that is invisible day-to-day but adds up to an ISP letter by the end of the month -- the shape I missed the first time. Next cycle, the pool comes back smaller, scales up gradually, and trips an alert long before it trips a bandwidth cap.
I had gotten excited. GitHub has been rough enough lately that people are openly walking away from it -- a cofounder of a company I used to work at just moved Ghostty off the platform. Leaving is not on my menu. I have a home server with a CPU from 2014 and a Dokku install, and pointing some of the CI load at it was, I thought, the pragmatic move: take the part of the pain that hardware can fix, and fix it.
The speedup was real. I just did not picture what sixteen workers
running flat-out actually consume. An agent saw a knob labeled
scale, turned it as far as it would go, and the bill came due two
layers down the stack at my ISP.
Comments
Leave a Comment