Back to blog

Fault Tolerance Is Not Just for Servers

architecturereliabilityagents

It sounds like a term from some highload talk. For small systems it often feels excessive — "why bother, we're not Facebook."

But if you simplify it enough, fault tolerance isn't about scale at all. It's about one simple idea: no matter what happens, the system shouldn't fall over. And this logic applies beautifully to everything, not just servers and databases.

Escalation up

You launch an agent — it crashes. Annoying? Of course. But instead of the classic "well, try again," it should be able to recover, continue from where it left off, and if it truly can't — gracefully degrade to a simpler scenario.

Sometimes this even works in reverse. The agent crashes, the system retries a few times while preserving progress, and if that doesn't work — it escalates. A more capable agent picks up the previous one's work and calmly finishes the task. From the user's perspective, nothing changes: they were waiting for a result, they got it.

Degradation down

There's another side of fault tolerance — downward degradation.

For example, a commit message plugin runs analysis automatically. But anything can go wrong at that stage: an error, invalid output, rate limits, too many changes at once. In that case, the system simply falls back to a simpler heuristic. The commit message comes out slightly worse, but it comes out. The user was waiting for a result — they got one. And most of the time, they don't even notice something went wrong.

The kitchen stays in the kitchen

Naturally, all these cases get logged. But that's a story about the internal kitchen, not the user experience: why it failed, where it broke, what can be improved. Post-mortems happen after, not during.

Because what happens in your kitchen is not the user's problem. They don't care that things were crashing, escalating, and restarting behind the scenes. They came for a result — and they should get it regardless.

That's the whole point of fault tolerance: the user should never pay for your internal problems. Even if it was total chaos inside.