programming is terriblelessons learned from a life wasted

Scaling in the presence of errors—don’t ignore them

Building a reliable, robust service often means building something that can keep working when some parts fail. A website where not every feature is available is often better than a website that’s entirely offline. Doing this in a meaningful way is not obvious.

The usual response is to hire more DBAs, more SREs, and even more folk in Support. Error handling, or making software that can recover from faults, often feels like the option of last resort—if ever considered in the first place.

The usual response to error handling is optimism. Unfortunately, the other choices aren’t exactly clear, and often difficult to choose from too. If you have two services, what do you do when one of them is offline: Try again later? Give up entirely? Or just ignore it and hope the problem goes away?

Surprisingly, all of these can be reasonable approaches. Even ignoring problems can work out for some systems. Sort-of. You don’t get to ignore errors, but sometimes recovering from an error can look very similar to ignoring it.


Imagine an orchard filled with wireless sensors for heat, light, and moisture. It makes little sense to try and resend old temperature readings upon error. It isn’t the sensor’s job to ensure the system works, and there’s very little a sensor can do about it, too. As a result, it’s pretty reasonable for a sensor to send messages with wild abandon—or in other words, fire-and-forget.

The idea behind fire-and-forget is that you don’t need to save old messages when the next message overrides it, or when a missing message will not cause problems. A situation where each message is treated as being the first message sent—forgetting that any attempt was made prior.

Done well, fire-and-forget is like a daily meeting—if someone misses the meeting, they can turn up the next day. Done badly, fire-and-forget is akin to replacing email with shouting across the office, hoping that someone else will take notes.

It isn’t that there’s no error handling in a fire-and-forget client, it’s that the best method of recovery is to just keep going. Unfortunately, people often misinterpret fire-and-forget to mean “avoid any error handling and hoping for the best”.

You don’t get to ignore errors.

When you ignore errors, you only put off discovering them—it’s not until another problem is caused that anyone even realises something has gone wrong. When you ignore errors, you waste time that could be spent recovering from them.

This is why, despite the occasional counter example, the best thing to do when encountering an error is to give up. Stop before you make anything worse and let something else handle it.


Giving up is a surprisingly reasonable approach to error handling, assuming something else will try to recover, restart, or resume the program. That’s why almost ever network service gets run in a loop—restarting immediately upon crashing, hoping the fault was transient. It often is.

There’s little point in trying to repeatedly connect to a database when the user is already mashing refresh in the browser. A unix pipeline could handle every possible bad outcome, but more often than not, running the program again makes everything work.

Although giving up is a good way to handle errors, restarting from scratch isn’t always the best way to recover from them.

Some pipelines work on large volumes or data, or do arduous amounts of numerical processing, and no-one is ever happy about repeating days or weeks or work. In theory, you could add error handling code, reduce the risk that the program will crash, and avoid an expensive restart, but in practice it’s often easier to restructure code to carry on where it left off.

In other words, give up, but save your progress to make restarting less time consuming.

For a pipeline, this usually entails a awful lot of temporary files—to save the output of each subcommand, and the result of splitting the input up into smaller batches. You can even retry things automatically, but for a lot of pipelines, manual recovery is still relatively inexpensive.

For other long running processes, this usually means something like checkpoints, or sagas. Or in other words, transforming a long running process into a short running one that’s run constantly, writing out the progress it makes to some file or database somewhere.

Over time, every long running process will get broken up into smaller parts, as restarting from scratch becomes prohibitively expensive. A long running process is just that more likely to run into an impossible error—full disks, no free memory, cosmic rays—and be forced to give up.

Sometimes the only way to handle an error is to give up.

As a result, the best way to handle errors is to structure your program to make recovery easier. Recovery is what makes all the difference between “fire-and-forget” and “ignoring-every-error” despite sharing the same optimism.

You can do things that look like ignoring errors, or even letting something else handle it, as long as there’s a plan to recover from them. Even if it’s restarting from scratch, even if it’s waking someone up at night, as long as there’s some plan, then you aren’t ignoring the problem. Assuming the plan works, that is.

You don’t get to ignore errors. They’re inevitably someone’s problem. If someone tells you they can ignore errors, they’re telling you someone else is on-call for their software.

That, or they’re using a message broker.


A message broker, if you’re not sure, is a networked service that offers a variety of queues that other computers on the network can interact with. Usually some clients enqueue messages, and others poll for the next unread message, but they can be used in a variety of other configurations too.

Like with a unix pipe, message brokers are used to get software off the ground. Similarly to using temporary files, the broker allows for different parts of the pipeline to consume and produce inputs at different rates, but don’t easily allow replaying or restarting when errors occur.

Like a unix pipe, message brokers are used in a very optimistic fashion. Firing messages into the queue and moving on to the next task at hand.

Somewhat like a unix pipeline, but with some notable differences. A unix pipeline blocks when full, pausing the producer until the consumer can catch up. A unix pipeline will exit if any of the subcommands exit, and return an error if the last subcommand failed.

A message broker does not block the producer until the consumer can catch up. In theory, this means transient errors or network issues between components don’t bring the entire system down. In practice, the more queues you have in a pipeline, the longer it takes to find out if there’s a problem.

Sometimes that works out. When there’s no growth, brokers act like a buffer between parts of a system, handling variance in load. They work well at slowing bursty clients down, and can provide a central point for auditing or access control.

When there is growth, queues explode regularly until some form of rate limiting appears. When more load arrives, queues are partitioned, and then repartitioned. Scaling a broker inevitably results in moving to something where the queue is bounded, or even ephemeral.

The problem with optimism is that when things do go wrong, not only do you have no idea how to fix it, you don’t even know what went wrong. To some extent, a message broker hides errors—programs can come and go as they please, and there’s no way to tell if the other part is still reading your messages—but it can only hide errors for so long.

In other words, fire-and-regret.

Although an unbounded queue is a tempting abstraction, it rarely lives up to the mythos of freeing you from having to handle errors. Unlike a unix pipeline, a message broker will always fill up your disks before giving up, and changing things to make recovery easy isn’t as straight forward as adding more temporary files.

Brokers can only recover from one error—a temporary network outage—so other mechanisms get brought in to compensate. Timeouts, retries, and sometimes even a second “priority” queue, because head-of-line blocking is genuinely terrible to deal with. Even then, if a worker crashes, messages can still get dropped.

Queues rarely help with recovery. They frequently impede it.

Imagine a build pipeline, or background job service where requests are dumped into some queue with wild abandon. When something breaks, or isn’t running like it is supposed to, you have no idea where to start recovery.

With a background queue, you can’t tell what jobs are currently being run right now. You can’t tell if something’s being retried, or failed, but maybe you’ve got log files you can search through. With logs, you can see what the system was doing a few minutes ago, but you still have no idea what it might be doing right now.

Even if you know the size of a queue, you’ll have to check the dashboard a few minutes later—to see if the line wiggled—before you know for sure if things are probably working. Hopefully.

Making a build pipeline with queues is relatively easy, but building one that the user can cancel, or watch, involves a lot more work. As soon as you want to cancel a task, or inspect a task, you need to keep things somewhere other than a queue.

Knowing what a program is up to means tracking the in-between parts, and even for something as simple as running a background task, it can involve many states—Created, Enqueued, Processing, Complete, Failed, not just Enqueued—and a broker only handles that last part.

Not very well. As soon as one queue feeds into another, an item of work can be in several different queues at once. If an item is missing from the queue, you know it’s either being dropped or processed, if an item is in the queue, you don’t know if it’s being processed, but you do know it will be. A queue doesn’t just hide errors, it hides state too.

Recovery means knowing what state the program was in before things went wrong, and when you fire-and-forget into a queue, you give up on knowing what happens to it. Handling errors, recovering from errors, means building software that can knows what state it is currently operating in. It also means structuring things to make recovery possible.

That, or you give up on on automated recovery of almost any kind. In some ways, I’m not arguing against fire-and-forget, or against optimism—but against optimism that prevents recovery. Not against queues, but how queues inevitably get used.

Unfortunately, recovery is relatively easy to imagine but not necessarily straight forward to implement.

This is why some people opt to use a replicated log, instead of a message broker.


If you’ve never used a replicated log, imagine an append only database table without a primary key, or a text file with backups, and you’re close. Or imagine a message broker, but instead of enqueue and dequeue, you can append to the log or read from the log.

Like a queue, a replicated log can be used in a fire-and-forget fashion with not so great consequences. Just like before, chaos will ensue as concepts like rate-limiting, head-of-line blocking, and the end-to-end-principle are slowly contended with—If you use a replicated log like a queue, it will fail like a queue.

Unlike a queue, a replicated log can aid recovery.

Every consumer sees the same log entries, in the same order, so it’s possible to recover by replaying the log, or by catching up on old entries. In some ways it’s more like using temporary files instead of a pipeline to join things together, and the strategies for recovery overlap with temporary files, too—like partitioning the log so that restarts aren’t as expensive.

Like temporary files, a replicated log can aid in recovery, but only to a certain point. A consumer will see the same messages, in the same order, but if a entry gets dropped before reaching the log, or if entries arrive in the wrong order, some, or potentially all hell can break loose.

You can’t just fire-and-forget into a log, not over a network. Although a replicated log is ordered, it will preserve the ordering it gets, whatever that happens to be.

This isn’t always a problem. Some logs are used to capture analytic data, or fed into aggregators, so the impact of a few missing or out of order entries is relatively low—a few missing entries might as well be called high-volume random sampling and declared a non-issue.

For other logs, missing entries could cause untold misery. Recovering from missing entries might involve rebuilding the entire log from scratch. If you’re using a replicated log for replication, you probably care quite a lot about the order of log entries.

Like before, you can’t ignore errors—you only make things expensive to recover from.

Handling errors like out of order or missing log entries means being able to work out when they have occurred.

This is more difficult than you might imagine.


Take two services, a primary and a secondary, both with databases, and imagine using a replicated log to copy changes from one to another.

It doesn’t seem to difficult at first. Every time the primary service makes a change to the database, it writes to to log. The secondary reads from the log, and updates its database. If the primary service is a single process, it’s pretty easy to ensure that every message is sent in the right order. When there’s more than one writer, things can get rather involved.

Now, you could switch things around—write to the log first, then apply the changes to the database, or use the database’s log directly—and avoid the problem altogether, but these aren’t always an option. Sometimes you’re forced to handle the problem of ordering the entries yourself.

In other words, you’ll need to order the messages before writing them to the log.

You could let something else provide the order, but you’d be mistaken if you think a timestamp would help. Clocks move forwards and backwards and this can cause all sorts of headaches.

One of the most frustrating problems with timestamps is ‘doomstones’: when a service deletes a key but has a wonky clock far out in the future, and issues an event with a similar timestamp. All operations get silently dropped until the deletion event is cleared. The other problem with timestamps is that if you have two entries, one after the other, you can’t tell if there are any entries that came between them.

Things like “Hybrid Logical Clocks”, or even atomic clocks can help to narrow down clock drift, but only so much. You can only narrow down the window of uncertainty, there’s still some clock skew. Again, clocks will go forwards and backwards—timestamps are terrible for ordering things precisely.

In practice you need explicit version numbers, 1,2,3… etc, or a unique identifier for each version of each entry, and a link back to the record being updated, to order messages.

With a version number, messages can be reordered, missing messages can be detected, and both can be recovered from, although managing and assigning those version numbers can be quite difficult in practice. Timestamps are still useful, if only for putting things in a human perspective, but without a version number, it’s impossible to know what precise order things happened in—and that no steps are missing, either.

You don’t get to ignore errors, but sometimes the error handling code isn’t that obvious.

Using version numbers or even timestamps both fall under building a plan for recovery. Building something that can continue to operate in the presence of failure. Unfortunately, building something that works when other parts fail is one of the more challenging parts of software engineering.

It doesn’t help that doing the same thing in the same order is so difficult that people use terms like causality and determinism to make the point sink in.

You don’t get to ignore errors, but no one said it was going to be easy.


Although using things like replicated logs, message brokers, or even using unix pipes can allow you to build prototypes, clear demonstrations of how your software works—they do not free you from the burden of handling errors.

You can’t avoid error handling code, not at scale.

The secret to error handling at scale isn’t giving up, ignoring the problem, or even it trying again—it is structuring a program for recovery, making errors stand out, allowing other parts of the program to make decisions.

Techniques like fail-fast, crash-only-software, process supervision, but also things like clever use of version numbers, and occasionally the odd bit of statelessness or idempotence. What these all have in common is that they’re all methods of recovery.

Recovery is the secret to handling errors. Especially at scale.

Giving up early so other things have a chance, continuing on so other things can catch up, restarting from a clean state to try again, saving progress so that things do not have to be repeated.

That, or put it off for a while. Buy a lot of disks, hire a few SREs, and add another graph to the dashboard.

The problem with scale is that you can’t approach it with optimism. As the system grows, it needs redundancy, or to be able to function in the presence of partial errors or intermittent faults. Humans can only fill in so many gaps.

Staff turnover is the worst form of technical debt.

Writing robust software means building systems that can exist in a state of partial failure (like incomplete output), and writing resilient software means building systems that are always in a state of recovery (like restarting)—neither come from engineering the happy path of your software.

When you ignore errors, you transform them into mysteries to solve. Something or someone else will have to handle them, and then have to recover from them—usually by hand, and almost always at great expense.

The problem with avoiding error handling in code is that you’re only avoiding automating it.

In other words, the trick to scaling in the presence of errors is building software around the notion of recovery. Automated recovery.

That, or burnout. Lots of burnout. You don’t get to ignore errors.