Reverts Over Rollbacks

Here's the thing about time reversal in physics: mathematically, the equations work both directions. Newton's laws, Maxwell's equations, even quantum mechanics — they're all time-symmetric. Run the film backwards and the math still checks out.

But we know from experience that time doesn't run backwards. Entropy increases. Eggs don't unscramble. The math says it's possible. Thermodynamics says it won't happen.

I think about this every time someone suggests rolling back a deployment.


I was on call last week when customers started seeing a blank white page. Nothing rendered. If you opened the browser console, a JavaScript error. That's it. Frontend only — but the frontend is what customers see, so a blank page might as well be a total outage.

The cause was almost comically small: a pnpm audit fix. A minor dependency bump that silently broke the frontend build. It slipped through without triggering CI, got approved without ever being built, and merged. From there, the merge queue carried it forward — every subsequent build failed, got deployed to staging, and then to production. One innocuous PR poisoned the pipeline.

I try to test every PR locally, or at least in a preview environment, before approving. That didn't happen here. A dependency audit feels safe — routine maintenance, not a code change. But "feels safe" is exactly when things slip through. The PRs that break production are rarely the ones you're worried about.

We found the cause fast. The question became: roll back or roll forward?


The instinct to roll back is strong. The old containers are still warm. The load balancer can just point back. It's one command. When production is on fire, fast feels like the only thing that matters.

But a rollback is trying to unscramble the egg.

The moment a breaking change goes live, entropy kicks in. The world moves forward. Migrations run. Caches fill with new shapes. Users take actions. External APIs receive calls. Each of these is a small, irreversible change — heat dissipating into the environment. The old code was tested against the world as it existed before all of that. Deploying it now is deploying it into a world it doesn't know about.

In our case, the damage was limited — a frontend break doesn't corrupt databases or trigger side effects. A rollback probably would have worked fine.

But "probably would have worked fine" is exactly the kind of thinking that got us here in the first place. A dependency bump probably wouldn't break anything. A rollback probably won't hit stale infrastructure.


And rollbacks are stale infrastructure. Think of it this way: your normal deployment flow is a well-worn path. It runs every day, gets exercised dozens of times a week, and people actively maintain it. A rollback is the side door you use in emergencies — which means it's the door you use least.

In a large org where infra is always evolving — new CI steps, new deploy targets, new security policies — unused paths go stale fast. The rollback tooling that worked six months ago may not work today, and you won't find out until you're already mid-incident. Like a fire exit that hasn't been inspected, it might open. It might not. Not the kind of uncertainty you want during an outage.

A revert goes through the front door. The same door as every other deploy. The door you know works, because you walked through it this morning.


So that's what we did — rolled forward. We identified the breaking dependency, wrote a targeted fix, pushed it, and let CI run. Our manual deploy process has options for skipping canary deploys — we could have jumped straight to production. But we let it run through CI and watched staging come back to life first. Only then did we jump the merge queue and push to production. We skipped some steps, but not the ones that matter. The fix shows up in the git log with a message. It can be reviewed. It's a decision, not an emergency lever.

In our case, a rollback would have been the same process anyway — same deploy pipeline, just with a different SHA. The revert wasn't even slower. It just pointed forward instead of backward.

The site came back. The dashboards settled. Then we figured out why.

That's the thing about moving forward: it still works even when something breaks. You don't need to reverse entropy. You just need to take the next step carefully.


At HashiCorp, every incident retro starts the same way: a reminder to be kind, non-blaming, and open-minded. Every time. It sounds like boilerplate — like re-reading the rules of the game every time you play. But it resets the room. It reminds everyone that the person who shipped the breaking change is sitting right there. We're all humans who make mistakes. We're here to solve problems together, not to assign fault.

That framing matters more than any technical process. If people are afraid to deploy, they batch changes into bigger, riskier releases. If people are afraid to admit mistakes, you don't find root causes. A blameless retro isn't soft — it's how you actually get better.

Reverts fit the same ethos. A revert says "this change didn't work, let's undo it and figure out why." A rollback says "something went wrong, hit the emergency brake." One is a calm step forward. The other is a panic response. And panic is where blame lives — in the rush to erase what happened rather than understand it.

Entropy isn't a failure. It's just the direction things move. The question is whether you work with it or pretend you can reverse it.

Time moves in one direction. Deployments should too.

Comments

Leave a Comment

Used for confirmation only. Not displayed publicly.