dokku-sso: Testing Against Real Apps
dokku-sso is my third Dokku plugin built with Claude Code. Each plugin taught me the same lesson harder: your tests are only as good as your test framework. A framework that limits you to testing internals will give you green checkmarks and broken software.
What It Does
I run my homelab on Dokku. Every app I self-host needs authentication - some want LDAP, some want OIDC, some just need a login page in front of them. dokku-sso manages all of it through Dokku's CLI.
dokku sso:directory:create main # spin up an LDAP server
dokku sso:directory:link main gitea # inject LDAP config into Gitea
dokku sso:frontend:create main # spin up Authelia or Authentik
dokku sso:frontend:protect radarr # put SSO in front of an app
dokku sso:oidc:enable grafana # register an OIDC client
Three LDAP providers (LLDAP, GLAuth, OpenLDAP), two SSO frontends (Authelia, Authentik). That's a lot of combinations. Which brings us to the real challenge.
Three Plugins, Three Approaches to Testing
dokku-dns: The Tests That Lied
dokku-dns was my first plugin. 5.5 months, 116 commits. The test suite used BATS with mocked API providers - stub out the AWS Route53 calls, verify the stubs get called, call it a day. The tests verified internal behavior: does this function parse DNS records correctly? Does this mock get called with the right arguments?
By Phase 25, Claude declared it "production ready" and "enterprise grade." The tests were green. Then I ran the actual commands a user would run, and nothing worked. Not Cloudflare. Not DigitalOcean. Not even AWS. The sync:deletions command would have deleted every DNS record in a hosted zone - MX records, CNAMEs, everything.
Every unit test passed because every unit test asked "does the code do what the code says it does?" The answer was yes. The question nobody asked was "does a DNS record actually appear?"
dokku-mail: Faster, Still Blind
dokku-mail was my second plugin. 12 days, 65 commits. Huge improvement in speed. The tests ran against a real Dokku instance this time - real containers, real plugin commands.
But email delivery was still mocked. MailHog caught outbound messages so I could assert they were sent. That's better than dokku-dns, but it still couldn't answer the question that actually matters: does the recipient's mail client accept this message? Does DKIM validation pass? Does the SPF record resolve correctly from the outside?
I shipped it faster. I had more confidence. But the gap between "tests pass" and "it works" was still there - just smaller.
dokku-sso: Test the Real Thing
dokku-sso was 6 days, 111 commits. The plugin is bash. The tests are TypeScript and Playwright. Not being limited to bash for testing turned out to be the biggest win.
dokku-dns and dokku-mail tested in BATS - bash's testing framework. BATS is fine for checking that a function returns the right string, but try orchestrating "spin up Gitea, configure LDAP, create a user, open a browser, type credentials, verify the dashboard loads" in bash. It's miserable. Playwright and TypeScript made complex test setups manageable - browser automation, async waits, readable assertions, structured test data. The testing framework stopped being a bottleneck, and suddenly the kind of tests I actually needed became practical to write.
It also helped that auth infrastructure is entirely local. dokku-dns needed real AWS, Cloudflare, or DigitalOcean credentials to test against real providers - expensive, slow, and a pain to set up in CI. dokku-mail needed real SMTP infrastructure. But LDAP servers, Authelia, Gitea, Grafana - those are all just containers. The entire test environment runs on a single machine with no external dependencies.
That made real integration testing practical in a way it never was for the previous plugins. You can't meaningfully mock an LDAP server and learn anything - the question is always "does Gitea actually accept this user?" or "can Grafana complete an OIDC flow through Authelia?"
The bugs are never in the LDAP protocol. They're in the gaps between systems:
- Gitea expects
uidas the login attribute, Nextcloud expectscn - Authelia's OIDC discovery endpoint returns a different issuer URL than what Grafana expects behind a reverse proxy
- OpenLDAP's schema requires attributes in a different order than LLDAP
- GitLab's LDAP integration silently fails if the bind DN format doesn't match its expectations
No mock catches those. Only the real app does.
Testing Against Real Apps
The dokku-sso CI spins up actual instances of six open source applications and verifies that authentication works end-to-end:
| App | Auth Method | What's Tested |
|---|---|---|
| Gitea | LDAP | User login via LDAP bind |
| Nextcloud | LDAP | User login via LDAP bind |
| Grafana | LDAP + OIDC | Both direct LDAP and OIDC via Authelia and Authentik |
| GitLab | LDAP | User login via LDAP bind |
| Radarr | Forward auth | Authelia protecting app via reverse proxy |
| oauth2-proxy | OIDC | Generic OIDC flow via Authelia |
Each test installs Dokku from scratch, installs the auth plugin, deploys the application, configures authentication, and then uses Playwright to log in through the actual UI. A Gitea test creates a directory service, links it to a Gitea instance, creates a user, opens a browser, types the username and password into Gitea's login form, and verifies the dashboard loads.
If the LDAP attributes are wrong, the test fails. If the OIDC redirect is misconfigured, the test fails. No mocks, no shortcuts.
Parallel Everything
Each of these tests takes minutes. GitLab alone needs a 20-minute timeout because the container takes that long to initialize. Running them sequentially would mean a CI pipeline measured in hours.
Instead, every test runs as an independent GitHub Actions job:
lint ─┐
unit-tests │
integration-tests │
e2e-gitea │
e2e-nextcloud │
e2e-multi-app │
e2e-oidc-app │
e2e-grafana-ldap ├── all parallel
e2e-grafana-oidc │
e2e-gitlab-ldap │
e2e-glauth │
e2e-openldap │
e2e-authentik │
e2e-authentik-grafana-ldap │
e2e-authentik-grafana-oidc │
e2e-radarr-forward-auth ─┘
16 jobs, zero dependencies between them. Each job gets its own runner, installs its own Dokku, deploys its own app. Wall-clock time is determined by the slowest job (usually GitLab at ~15 minutes), not the sum of all jobs.
Each test is fully self-contained. No shared database, no shared Docker daemon, no shared Dokku instance. Each job is a clean room.
The Progression
| dokku-dns | dokku-mail | dokku-sso | |
|---|---|---|---|
| Timeline | 5.5 months | 12 days | 6 days |
| Commits | 116 | 65 | 111 |
| Test framework | BATS | BATS | Playwright + TypeScript |
| CI jobs | 3 | 2 | 17 |
| Tests real apps | No (mocked) | Partial (MailHog) | Yes - 6 apps |
The timeline shrunk 27x. The test coverage went from fiction to real. But the most important row is the last one - and the shift from BATS to Playwright made it possible.
BATS kept me stuck testing internals because that's all it was practical for. Moving to Playwright and TypeScript didn't just change the language - it changed what I could test. Browser automation, multi-service orchestration, real UI verification. The tests I actually needed were finally tests I could write.
The lesson across three plugins: when working with LLMs, unit tests have diminishing value. The LLM is good at making internals look correct. What it can't fake is the end result. Test the outcome.
The Tradeoff
Each job pulls multi-gigabyte Docker images and takes 5-20 minutes. But they only run when relevant - a change to the Authelia config doesn't re-test GLAuth, a Gitea fix doesn't re-run the Grafana OIDC suite. Most PRs trigger a handful of jobs, not all 16.
Auth bugs don't throw errors - they silently lock you out. Or worse, they expose things you thought were private. CI minutes are cheap currency for that kind of reassurance.
Comments
Leave a Comment