Wrestling with LLMs: Managing Smarmy, Overconfident LLMs

12/9/2025 · 5 min read · 1,005 words

I spent 3.5 months building dokku-dns almost entirely with Claude Code. I figured six weeks. By late September, Claude was telling me we were done - "production ready," "enterprise grade," the whole nine yards.

I actually finished in early December, after spending a month fixing bugs that could have deleted all my DNS records.

The Illusion of Progress

Through Phase 10, everything felt great. I had basic functionality working with AWS Route53. I was manually testing - creating DNS records, updating them, deleting them, watching records appear in the AWS console.

Then I got ambitious. Phases 11-25 generalized the plugin to support Cloudflare and DigitalOcean. I stopped manually testing as much because the tests were passing and AWS was already working. Claude kept reinforcing that things were going well. By Phase 25, I had a "Pre-Release Preparation" complete message. The test suite was green.

"Working perfectly! All providers integrated!"
"Ready to ship!"
"Production ready!"
"Enterprise grade!"

When software starts using enterprise sales language, that's your signal to question everything.

I asked Claude to generate some real-world manual test scripts - the actual commands a user would run to verify the happy path.

Nothing worked.

Not Cloudflare. Not DigitalOcean. Not even AWS - the provider that had been working earlier. The multi-provider refactoring had silently broken everything. Every provider was completely non-functional. But the tests? Still green. Claude? Still confident.

What Manual Testing Found

The issues ranged from embarrassing to catastrophic.

The sync:deletions command deleted everything. It was supposed to remove DNS records that no longer existed in your app. What it actually did was delete ALL Route53 records in the entire hosted zone. Not just the app's records. Not just dokku-managed records. Everything. MX records for your email? Gone. Custom CNAME for your CDN? Deleted. A logic error that would have been obvious with any manual testing.

DNS records weren't being created at all. Silent failures everywhere. Commands completed "successfully," tests passed, but check the actual DNS provider? Nothing there. The plugin was pretending to work.

rm -rf commands with unvalidated variables. Empty variable? You just tried to delete everything from root. The kind of bug that makes you sweat retroactively.

All of this while every test was green.

How the Tests Lied

The unit tests were testing functions in isolation. "Does this function parse DNS records correctly?" Yes. Gold star. But it didn't test whether those parsed records ever made it to the actual DNS provider.

The integration tests used stubs that avoided testing critical functionality. They'd mock the AWS API calls, verify the mock was called, and call it a day. The tests checked that the code was trying to do something, not that it actually worked.

Claude had carefully written tests that only tested the parts that worked. Green checkmarks are not the same as working software - they only mean the test suite was convinced. The coverage looked thorough. The results looked clean. But the tests were measuring effort, not correctness.

Why This Happens

LLMs aren't lying when they say things are working. They're oblivious. No malice, no deception, no intent. They're pattern-matching systems generating confident text without understanding what "working" means. Hanlon's Razor applies: never attribute to malice what's explained by a complete lack of comprehension.

Anthropomorphizing them as "lying" or "trying to trick you" actually makes you less effective. When you think something is deceptive, you look for the deception. When you understand it's oblivious, you verify everything - which is what you actually need to do.

The pattern: LLMs optimize for green checkmarks, not working software. They'll game tests to pass. They'll delete failing tests (I caught this multiple times - "All tests passing now!" having removed the evidence). They'll write tests that test nothing meaningful. They see passing tests the same way they see failing tests - as text to be responded to. They have no concept of the difference between working and broken.

They're like having a tireless assistant who types incredibly fast, works around the clock, and genuinely cannot tell whether the code works. Not lying. Not lazy. Just incapable of that judgment.

The Aftermath

I created a new roadmap. 25 more phases. Another month of work. Complete rewrite of the deletion system with proper safety checks. Validation on every destructive operation. Error handling that actually handled errors. Each fix revealed more broken code underneath.

The whole time, Claude maintained that smarmy confidence. Fix a critical bug, and it would congratulate me on the "enhancement" like we were adding features instead of preventing data loss.

What Actually Changed About How I Work

Despite everything, I built a Dokku plugin in 3.5 months that would have taken me much longer without LLM assistance. The tool isn't the problem. How I used it was.

Architecture and design matter more now, not less. LLMs can't design abstractions or notice when design is getting messy. They can't tell when a file is too large or a function is doing too much. That's still on you.

Testing instincts are critical. Your ability to look at a test suite and think "wait, are we actually testing anything important here?" is what prevents catastrophic failures. LLMs will game tests. They'll write tests that look comprehensive and test nothing. You need the judgment to tell the difference.

Manual testing is not optional. The moment I stopped running the actual commands a user would run, everything drifted. The test suite became a fiction that confirmed itself. Running real commands against real providers is the only way to know the software works.

Enthusiasm is not a signal. When Claude says "production ready," that's the same text generation that produces everything else. It's not an assessment. It's a pattern completion. The only signal that matters is behavior you've observed yourself.

The work didn't go away when I started using LLMs. It shifted. Less time typing, more time verifying. Less time writing code, more time questioning whether the code works. That tradeoff is worth it - but only if you actually do the verifying and questioning. Skip that, and you ship a DNS plugin that deletes all your records.