As a curmudgeon, I am often full of reckons. A reckon is often confused for an opinion, but it’s more like a heuristic. It isn’t always true in theory, but it’s often true in practice. Today’s reckons are about message brokers.
They are used by inexperienced programmers to glue systems together. Although they are easy to set up and to get going, when they fail, they take the entire application with them — a broker is often trading convenience for a central point of failure.
Before you start throwing away your working code base, I want to ask you a few questions: What do you do when the broker fails? What happens if a message isn’t handled? What about consumers or producers crashing? Knowing the answers to these, are the first steps to understanding if your broker is helping or hurting your system.
To show how a broker can hurt, let’s look at a classic use: running tasks in parallel over a cluster of machines. In particular, I encountered problems while working on a distributed web crawler built around a message broker.
The crawler starts from a seed list of URLs, pushing them into the broker, meanwhile a number of workers pull out pages to crawl and then push any found links back through the broker, into the crawler. Both the workers and crawler were run by hand. Simple enough, but we encountered many failure cases.
If a single worker died, it would take the task with it, we could resend after a timeout, but if the task froze the worker, eventually all of the workers would halt. If enough workers died the queue would fill up unknowingly and collapse, or worse, the crawler would sit idly for days, not realising anything was broken. When the crawler broke, a similar process happened. The workers would churn away for hours, racking up cpu time to no avail. Occasionally, we could restart the workers or crawler, and everything would be fine. Sometimes we’d have to reset the queues, throw away the work and restart. These failures cost us a lot of money, and an awful lot of time, things a small company didn’t have an abundance of.
The more experienced engineers amongst you will know what we had done wrong. Brokers are a great way of isolating components in your system, and unfortunately we’d isolated the crawler and workers from finding out if something had failed.
There are many ways to fix this: We can set a hard limit on the queue, or a maximum rate of messages, or even set a time to live on the messages. Introducing back-pressure in the system would allow errors to propagate quicker, saving us some time. Another complementary technique is to add acknowledgement messages, allowing the crawler to know when a job was accepted, and heartbeat messages to know that the job was still ongoing.
Instead, we got rid of the message broker.
The crawler communicated with the workers directly, and managed the worker pool. When launched by the crawler, workers would poll the crawler for work, sending all of the results back. If a worker failed, the crawler would notice, and kill the remote process. If the crawler failed, the workers would terminate. At all times, the crawler knew which machine was doing which task. At last, something in the system was responsible for handling failure, beyond the already stretched operations team.
At this point, the canny reader will shout “Ha! But there is queueing going on here”, and sure enough, there was. Instead of a queue hiding away in the middle of the network, we’d pushed it into the crawler. There was still messaging too, but we were just using TCP.
We’d ignored and subsequently embraced three good design principles for reliable systems: Fail fast, Process Supervision, and the End-to-End principle.
Fail fast means that processes give up as soon as they encounter failure, rather than ambling on and hoping for the best. Process supervision means that errant components are watched and restarted. These two principles alone account for much of the reliability of Erlang/OTP. On the other hand, The End-to-End principle is why TCP works.
TCP gets reliable delivery over an unreliable system by getting the hosts to handle missing or broken messages, rather than burdening it with the routers. Handling things at the edges covers problems in the middle. We’d done something similar, pushing the queue inside the crawler: what was now responsible for handing out work was now responsible for handling when it failed.
Not every queue was like ours, although i’ve encountered a few. A similar pattern emerges when brokers are persistent: The temptation is to avoid storing messages at the ends, and when the broker fails, all of the new messages are lost. When you restart the broker, the system still doesn’t work until the backlog of messages is cleared.
In the end, it isn’t so much about message brokers, but treating them as infallible gods. By using acknowledgements, back-pressure and other techniques you can move responsibility out of the brokers and into the edges. What was once a central point of failure becomes an effectively stateless server, no-longer hiding failures from your components. Your system works because it knows that it can fail.
Remember the golden rule of reckons: It depends. Engineering in practice is not a series of easy choices, but a series of tradeoffs. Brokers aren’t bad, but the tradeoffs you’re making might not be in your favour.