programming is terriblelessons learned from a life wasted

Repeat yourself, do more than one thing, and rewrite everything

If you ask a programmer for advice—a terrible idea—they might tell you something like the following: Don’t repeat yourself. Programs should do one thing and one thing well. Never rewrite your code from scratch, ever!.

Following “Don’t Repeat Yourself” might lead you to a function with four boolean flags, and a matrix of behaviours to carefully navigate when changing the code. Splitting things up into simple units can lead to awkward composition and struggling to coordinate cross cutting changes. Avoiding rewrites means they’re often left so late that they have no chance of succeeding.

The advice isn’t inherently bad—although there is good intent, following it to the letter can create more problems than it promises to solve.

Sometimes the best way to follow an adage is to do the exact opposite: embrace feature switches and constantly rewrite your code, pull things together to make coordination between them easier to manage, and repeat yourself to avoid implementing everything in one function..

This advice is much harder to follow, unfortunately.

Repeat yourself to find abstractions.

“Don’t Repeat Yourself” is almost a truism—if anything, the point of programming is to avoid work.

No-one enjoys writing boilerplate. The more straightforward it is to write, the duller it is to summon into a text editor. People are already tired of writing eight exact copies of the same code before even having to do so. You don’t need to convince programmers not to repeat themselves, but you do need to teach them how and when to avoid it.

“Don’t Repeat Yourself” often gets interpreted as “Don’t Copy Paste” or to avoid repeating code within the codebase, but the best form of avoiding repetition is in avoiding reimplementing what exists elsewhere—and thankfully most of us already do!

Almost every web application leans heavily on an operating system, a database, and a variety of other lumps of code to get the job done. A modern website reuses millions of lines of code without even trying. Unfortunately, programmers love to avoid repetition, and “Don’t Repeat Yourself” turns into “Always Use an Abstraction”.

By an abstraction, I mean two interlinked things: a idea we can think and reason about, and the way in which we model it inside our programming languages. Abstractions are way of repeating yourself, so that you can change multiple parts of your program in one place. Abstractions allow you to manage cross-cutting changes across your system, or sharing behaviors within it.

The problem with always using an abstraction is that you’re preemptively guessing which parts of the codebase need to change together. “Don’t Repeat Yourself” will lead to a rigid, tightly coupled mess of code. Repeating yourself is the best way to discover which abstractions, if any, you actually need.

As Sandi Metz put it, “duplication is far cheaper than the wrong abstraction”.

You can’t really write a re-usable abstraction up front. Most successful libraries or frameworks are extracted from a larger working system, rather than being created from scratch. If you haven’t built something useful with your library yet, it is unlikely anyone else will. Code reuse isn’t a good excuse to avoid duplicating code, and writing reusable code inside your project is often a form of preemptive optimization.

When it comes to repeating yourself inside your own project, the point isn’t to be able to reuse code, but rather to make coordinated changes. Use abstractions when you’re sure about coupling things together, rather than for opportunistic or accidental code reuse—it’s ok to repeat yourself to find out when.

Repeat yourself, but don’t repeat other people’s hard work. Repeat yourself: duplicate to find the right abstraction first, then deduplicate to implement it.

With “Don’t Repeat Yourself”, some insist that it isn’t about avoiding duplication of code, but about avoiding duplication of functionality or duplication of responsibility. This is more popularly known as the “Single Responsibility Principle”, and it’s just as easily mishandled.

Gather responsibilities to simplify interactions between them

When it comes to breaking a larger service into smaller pieces, one idea is that each piece should only do one thing within the system—do one thing, and do it well—and the hope is that by following this rule, changes and maintenance become easier.

It works out well in the small: reusing variables for different purposes is an ever-present source of bugs. It’s less successful elsewhere: although one class might do two things in a rather nasty way, disentangling it isn’t of much benefit when you end up with two nasty classes with a far more complex mess of wiring between them.

The only real difference between pushing something together and pulling something apart is that some changes become easier to perform than others.

The choice between a monolith and microservices is another example of this—the choice between developing and deploying a single service, or composing things out of smaller, independently developed services.

The big difference between them is that cross-cutting change is easier in one, and local changes are easier in the other. Which one works best for a team often depends more on environmental factors than on the specific changes being made.

Although a monolith can be painful when new features need to be added and microservices can be painful when co-ordination is required, a monolith can run smoothly with feature flags and short lived branches and microservices work well when deployment is easy and heavily automated.

Even a monolith can be decomposed internally into microservices, albeit in a single repository and deployed as a whole. Everything can be broken into smaller parts—the trick is knowing when it’s an advantage to do so.

Modularity is more than reducing things to their smallest parts.

Invoking the ‘single responsibility principle’, programmers have been known to brutally decompose software into a terrifyingly large number of small interlocking pieces—a craft rarely seen outside of obscenely expensive watches, or bash.

The traditional UNIX command line is a showcase of small components that do exactly one function, and it can be a challenge to discover which one you need and in which way to hold it to get the job done. Piping things into awk '{print $2}' is almost a rite of passage.

Another example of the single responsibility principle is git. Although you can use git checkout to do six different things to the repository, they all use similar operations internally. Despite having singular functionality, components can be used in very different ways.

A layer of small components with no shared features creates a need for a layer above where these features overlap, and if absent, the user will create one, with bash aliases, scripts, or even spreadsheets to copy-paste from.

Even adding this layer might not help you: git already has a notion of user-facing and automation-facing commands, and the UI is still a mess. It’s always easier to add a new flag to an existing command than to it is to duplicate it and maintain it in parallel.

Similarly, functions gain boolean flags and classes gain new methods as the needs of the codebase change. In trying to avoid duplication and keep code together, we end up entangling things.

Although components can be created with a single responsibility, over time their responsibilities will change and interact in new and unexpected ways. What a module is currently responsible for within a system does not necessarily correlate to how it will grow.

Modularity is about limiting the options for growth

A given module often gets changed because it is the easiest module to change, rather than the best place for the change to be made. In the end, what defines a module is what pieces of the system it will never responsible for, rather what it is currently responsible for.

When a unit has no rules about what code cannot be included, it will eventually contain larger and larger amounts of the system. This is eternally true of every module named ‘util’, and why almost everything in a Model-View-Controller system ends up in the controller.

In theory, Model-View-Controller is about three interlocking units of code. One for the database, another for the UI, and one for the glue between them. In practice, Model-View-Controller resembles a monolith with two distinct subsystems—one for the database code, another for the UI, both nestled inside the controller.

The purpose of MVC isn’t to just keep all the database code in one place, but also to keep it away from frontend code. The data we have and how we want to view it will change over time independent of the frontend code.

Although code reuse is good and smaller components are good, they should be the result of other desired changes. Both are tradeoffs, introducing coupling through a lack of redundancy, or complexity in how things are composed. Decomposing things into smaller parts or unifying them is neither universally good nor bad for the codebase, and largely depends on what changes come afterwards.

In the same way abstraction isn’t about code reuse, but coupling things for change, modularity isn’t about grouping similar things together by function, but working out how to keep things apart and limiting co-ordination across the codebase.

This means recognizing which bits are slightly more entangled than others, knowing which pieces need to talk to each other, which need to share resources, what shares responsibilities, and most importantly, what external constraints are in place and which way they are moving.

In the end, it’s about optimizing for those changes—and this is rarely achieved by aiming for reusable code, as sometimes handling changes means rewriting everything.

Rewrite Everything

Usually, a rewrite is only a practical option when it’s the only option left. Technical debt, or code the seniors wrote that we can’t be rude about, accrues until all change becomes hazardous. It is only when the system is at breaking point that a rewrite is even considered an option.

Sometimes the reasons can be less dramatic: an API is being switched off, a startup has taken a beautiful journey, or there’s a new fashion in town and orders from the top to chase it. Rewrites can happen to appease a programmer too—rewarding good teamwork with a solo project.

The reason rewrites are so risky in practice is that replacing one working system with another is rarely an overnight change. We rarely understand what the previous system did—many of its properties are accidental in nature. Documentation is scarce, tests are ornamental, and interfaces are organic in nature, stubbornly locking behaviors in place.

If migrating to the replacement depends on switching over everything at once, make sure you’ve booked a holiday during the transition, well in advance.

Successful rewrites plan for migration to and from the old system, plan to ease in the existing load, and plan to handle things being in one or both places at once. Both systems are continuously maintained until one of them can be decommissioned. A slow, careful migration is the only option that reliably works on larger systems.

To succeed, you have to start with the hard problems first—often performance related—but it can involve dealing with the most difficult customer, or biggest customer or user of the system too. Rewrites must be driven by triage, reducing the problem in scope into something that can be effectively improved while being guided by the larger problems at hand.

If a replacement isn’t doing something useful after three months, odds are it will never do anything useful.

The longer it takes to run a replacement system in production, the longer it takes to find bugs. Unfortunately, migrations get pushed back in the name of feature development. A new project has the most room for feature bloat—this is known as the second-system effect.

The second system effect is the name of the canonical doomed rewrite, one where numerous features are planned, not enough are implemented, and what has been written rarely works reliably. It’s a similar to writing a game engine without a game to implement to guide decisions, or a framework without a product inside. The resulting code is an unconstrained mess that is barely fit for its purpose.

The reason we say “Never Rewrite Code” is that we leave rewrites too late, demand too much, and expect them to work immediately. It’s more important to never rewrite in a hurry than to never rewrite at all.

null is true, everything is permitted

The problem with following advice to the letter is that it rarely works in practice. The problem with following it at all costs is that eventually we cannot afford to do so.

It isn’t “Don’t Repeat Yourself”, but “Some redundancy is healthy, some isn’t”, and using abstractions when you’re sure you want to couple things together.

It isn’t “Each thing has a unique component”, or other variants of the single responsibility principle, but “Decoupling parts into smaller pieces is often worth it if the interfaces are simple between them, and try to keep the fast changing and tricky to implement bits away from each other”.

It’s never “Don’t Rewrite!”, but “Don’t abandon what works”. Build a plan for migration, maintain in parallel, then decommission, eventually. In high-growth situations you can probably put off decommissioning, and possibly even migrations.

When you hear a piece of advice, you need to understand the structure and environment in place that made it true, because they can just as often make it false. Things like “Don’t Repeat Yourself” are about making a tradeoff, usually one that’s good in the small or for beginners to copy at first, but hazardous to invoke without question on larger systems.

In a larger system, it’s much harder to understand the consequences of our design choices—in many cases the consequences are only discovered far, far too late in the process and it is only by throwing more engineers into the pit that there is any hope of completion.

In the end, we call our good decisions ‘clean code’ and our bad decisions ‘technical debt’, despite following the same rules and practices to get there.

Write code that’s easy to delete, and easy to debug too.

Debuggable code is code that doesn’t outsmart you. Some code is a little to harder to debug than others: code with hidden behaviour, poor error handling, ambiguity, too little or too much structure, or code that’s in the middle of being changed. On a large enough project, you’ll eventually bump into code that you don’t understand.

On an old enough project, you’ll discover code you forgot about writing—and if it wasn’t for the commit logs, you’d swear it was someone else. As a project grows in size it becomes harder to remember what each piece of code does, harder still when the code doesn’t do what it is supposed to. When it comes to changing code you don’t understand, you’re forced to learn about it the hard way: Debugging.

Writing code that’s easy to debug begins with realising you won’t remember anything about the code later.

Rule 0: Good code has obvious faults.

Many used methodology salesmen have argued that the way to write understandable code is to write clean code. The problem is that “clean” is highly contextual in meaning. Clean code can be hardcoded into a system, and sometimes a dirty hack can written in a way that’s easy to turn off. Sometimes the code is clean because the filth has been pushed elsewhere. Good code isn’t necessarily clean code.

Code being clean or dirty is more about how much pride, or embarrassment the developer takes in the code, rather than how easy it has been to maintain or change. Instead of clean, we want boring code where change is obvious— I’ve found it easier to get people to contribute to a code base when the low hanging fruit has been left around for others to collect. The best code might be anything you can look at quickly learn things about it.

Sometimes, code is just nasty as fuck, and any attempts to clean it up leaves you in a worse state. Writing clean code without understanding the consequences of your actions might as well be a summoning ritual for maintainable code.

It is not to say that clean code is bad, but sometimes the practice of clean coding is more akin to sweeping problems under the rug. Debuggable code isn’t necessarily clean, and code that’s littered with checks or error handling rarely makes for pleasant reading.

Rule 1: The computer is always on fire.

The computer is on fire, and the program crashed the last time it ran.

The first thing a program should do is ensure that it is starting out from a known, good, safe state before trying to get any work done. Sometimes there isn’t a clean copy of the state because the user deleted it, or upgraded their computer. The program crashed the last time it ran and, rather paradoxically, the program is being run for the first time too.

For example, when reading and writing program state to a file, a number of problems can happen:

These are not new problems and databases have been dealing with them since the dawn of time (1970-01-01). Using something like SQLite will handle many of these problems for you, but If the program crashed the last time it ran, the code might be run with the wrong data, or in the wrong way too.

With scheduled programs, for example, you can guarantee that the following accidents will occur:

Writing robust software begins with writing software that assumed it crashed the last time it ran, and crashing whenever it doesn’t know the right thing to do. The best thing about throwing an exception over leaving a comment like “This Shouldn’t Happen”, is that when it inevitably does happen, you get a head-start on debugging your code.

You don’t have to be able to recover from these problems either—it’s enough to let the program give up and not make things any worse. Small checks that raise an exception can save weeks of tracing through logs, and a simple lock file can save hours of restoring from backup.

Code that’s easy to debug is code that checks to see if things are correct before doing what was asked of it, code that makes it easy to go back to a known good state and trying again, and code that has layers of defence to force errors to surface as early as possible.

Rule 2: Your program is at war with itself.

Google’s biggest DoS attacks come from ourselves—because we have really big systems—although every now and then someone will show up and try to give us a run for our money, but really we’re more capable of hammering ourselves into the ground than anybody else is.

This is true for all systems.

Astrid Atkinson, Engineering for the Long Game

The software always crashed the last time it ran, and now it is always out of cpu, out of memory, and out of disk too. All of the workers are hammering an empty queue, everyone is retrying a failed request that’s long expired, and all of the servers have paused for garbage collection at the same time. Not only is the system broken, it is constantly trying to break itself.

Even checking if the system is actually running can be quite difficult.

It can be quite easy to implement something that checks if the server is running, but not if it is handling requests. Unless you check the uptime, it is possible that the program is crashing in-between every check. Health checks can trigger bugs too: I have managed to write health checks that crashed the system it was meant to protect. On two separate occasions, three months apart.

In software, writing code to handle errors will inevitably lead to discovering more errors to handle, many of them caused by the error handling itself. Similarly, performance optimisations can often be the cause of bottlenecks in the system—Making an app that’s pleasant to use in one tab can make an app that’s painful to use when you have twenty copies of it running.

Another example is where a worker in a pipeline is running too fast, and exhausting the available memory before the next part has a chance to catch up. If you’d rather a car metaphor: traffic jams. Speeding up is what creates them, and can be seen in the way the congestion moves back through the traffic. Optimisations can create systems that fail under high or heavy load, often in mysterious ways.

In other words: the faster you make it, the harder it will be pushed, and if you don’t allow your system to push back even a little, don’t be surprised if it snaps.

Back-pressure is one form of feedback within a system, and a program that is easy to debug is one where the user is involved in the feedback loop, having insight into all behaviours of a system, the accidental, the intentional, the desired, and the unwanted too. Debuggable code is easy to inspect, where you can watch and understand the changes happening within.

Rule 3: What you don’t disambiguate now, you debug later.

In other words: it should not be hard to look at the variables in your program and work out what is happening. Give or take some terrifying linear algebra subroutines, you should strive to represent your program’s state as obviously as possible. This means things like not changing your mind about what a variable does halfway through a program, if there is one obvious cardinal sin it is using a single variable for two different purposes.

It also means carefully avoiding the semi-predicate problem, never using a single value (count) to represent a pair of values (boolean, count). Avoiding things like returning a positive number for a result, and returning -1 when nothing matches. The reason is that it’s easy to end up in the situation where you want something like "0, but true" (and notably, Perl 5 has this exact feature), or you create code that’s hard to compose with other parts of your system (-1 might be a valid input for the next part of the program, rather than an error).

Along with using a single variable for two purposes, it can be just as bad to use a pair of variables for a single purpose—especially if they are booleans. I don’t mean keeping a pair of numbers to store a range is bad, but using a number of booleans to indicate what state your program is in is often a state machine in disguise.

When state doesn’t flow from top to bottom, give or take the occasional loop, it’s best to give the state a variable of it’s own and clean the logic up. If you have a set of booleans inside an object, replace it with a variable called state and use an enum (or a string if it’s persisted somewhere). The if statements end up looking like if state == name and stop looking like if bad_name && !alternate_option.

Even when you do make the state machine explicit, you can still mess up: sometimes code has two state machines hidden inside. I had great difficulty writing an HTTP proxy until I had made each state machine explicit, tracing connection state and parsing state separately. When you merge two state machines into one, it can be hard to add new states, or know exactly what state something is meant to be in.

This is far more about creating things you won’t have to debug, than making things easy to debug. By working out the list of valid states, it’s far easier to reject the invalid ones outright, rather than accidentally letting one or two through.

Rule 4: Accidental Behaviour is Expected Behaviour.

When you’re less than clear about what a data structure does, users fill in the gaps—any behaviour of your code, intended or accidental, will eventually be relied upon somewhere else. Many mainstream programming languages had hash tables you could iterate through, which sort-of preserved insertion order, most of the time.

Some languages chose to make the hash table behave as many users expected them to, iterating through the keys in the order they were added, but others chose to make the hash table return keys in a different order, each time it was iterated through. In the latter case, some users then complained that the behaviour wasn’t random enough.

Tragically, any source of randomness in your program will eventually be used for statistical simulation purposes, or worse, cryptography, and any source of ordering will be used for sorting instead.

In a database, some identifiers carry a little bit more information than others. When creating a table, a developer can choose between different types of primary key. The correct answer is a UUID, or something that’s indistinguishable from a UUID. The problem with the other choices is that they can expose ordering information as well as identity, i.e. not just if a == b but if a <= b, and by other choices mean auto-incrementing keys.

With an auto-incrementing key, the database assigns a number to each row in the table, adding 1 when a new row is inserted. This creates an ambiguity of sorts: people do not know which part of the data is canonical. In other words: Do you sort by key, or by timestamp? Like with the hash-tables before, people will decide the right answer for themselves. The other problem is that users can easily guess the other keys records nearby, too.

Ultimately any attempt to be smarter than a UUID will backfire: we already tried with postcodes, telephone numbers, and IP Addresses, and we failed miserably each time. UUIDs might not make your code more debuggable, but less accidental behaviour tends to mean less accidents.

Ordering is not the only piece of information people will extract from a key: If you create database keys that are constructed from the other fields, then people will throw away the data and reconstruct it from the key instead. Now you have two problems: when a program’s state is kept in more than one place, it is all too easy for the copies to start disagreeing with each other. It’s even harder to keep them in sync if you aren’t sure which one you need to change, or which one you have changed.

Whatever you permit your users to do, they’ll implement. Writing debuggable code is thinking ahead about the ways in which it can be misused, and how other people might interact with it in general.

Rule 5: Debugging is social, before it is technical.

When a software project is split over multiple components and systems, it can be considerably harder to find bugs. Once you understand how the problem occurs, you might have to co-ordinate changes across several parts in order to fix the behaviour. Fixing bugs in a larger project is less about finding the bugs, and more about convincing the other people that they’re real, or even that a fix is possible.

Bugs stick around in software because no-one is entirely sure who is responsible for things. In other words, it’s harder to debug code when nothing is written down, everything must be asked in Slack, and nothing gets answered until the one person who knows logs-on.

Planning, tools, process, and documentation are the ways we can fix this.

Planning is how we can remove the stress of being on call, structures in place to manage incidents. Plans are how we keep customers informed, switch out people when they’ve been on call too long, and how we track the problems and introduce changes to reduce future risk. Tools are the way in which we deskill work and make it accessible to others. Process is the way in which can we remove control from the individual and give it to the team.

The people will change, the interactions too, but the processes and tools will be carried on as the team mutates over time. It isn’t so much valuing one more than the other but building one to support changes in the other.Process can also be used to remove control from the team too, so it isn’t always good or bad, but there is always some process at work, even when it isn’t written down, and the act of documenting it is the first step to letting other people change it.

Documentation means more than text files: documentation is how you handover responsibilities, how you bring new people up to speed, and how you communicate what’s changed to the people impacted by those changes. Writing documentation requires more empathy than writing code, and more skill too: there aren’t easy compiler flags or type checkers, and it’s easy to write a lot of words without documenting anything.

Without documentation, how can you expect people to make informed decisions, or even consent to the consequences of using the software? Without documentation, tools, or processes you cannot share the burden of maintenance, or even replace the people currently lumbered with the task.

Making things easy to debug applies just as much to the processes around code as the code itself, making it clear whose toes you will have to stand on to fix the code.

Code that’s easy to debug is easy to explain.

A common occurrence when debugging is realising the problem when explaining it to someone else. The other person doesn’t even have to exist but you do have to force yourself to start from scratch, explain the situation, the problem, the steps to reproduce it, and often that framing is enough to give us insight into the answer.

If only. Sometimes when we ask for help, we don’t ask for the right help, and I’m as guilty of this as anyone—it’s such a common affliction that it has a name: “The X-Y Problem”: How do I get the last three letters of a filename? Oh? No, I meant the file extension.

We talk about problems in terms of the solutions we understand, and we talk about the solutions in terms of the consequences we’re aware of. Debugging is learning the hard way about unexpected consequences, and alternative solutions, and involves one of the hardest things a programer can ever do: admit that they got something wrong.

It wasn’t a compiler bug, after all.

Psychological Safety in Operation Teams

Psychological Safety in Operation Teams (usenix.org)

Think of a team you work with closely. How strongly do you agree with these five statements?

  1. If I take a chance and screw up, it will be held against me.
  2. Our team has a strong sense of culture that can be hard for new people to join.
  3. My team is slow to offer help to people who are struggling.
  4. Using my unique skills and talents comes second to the objectives of the team.
  5. It’s uncomfortable to have open, honest conversations about our team’s sensitive issues.

Teams that score high on questions like these can be deemed to be “unsafe.”

How do you cut a monolith in half?

It depends.

The problem with distributed systems, is that no matter what the question is, the answer is inevitably ‘It Depends’.

When you cut a larger service apart, where you cut depends on latency, resources, and access to state, but it also depends on error handling, availably and recovery processes. It depends, but you probably don’t want to depend on a message broker.

Using a message broker to distribute work is like a cross between a load balancer with a database, with the disadvantages of both and the advantages of neither.

Message brokers, or persistent queues accessed by publish-subscribe, are a popular way to pull components apart over a network. They’re popular because they often have a low setup cost, and provide easy service discovery, but they can come at a high operational cost, depending where you put them in your systems.

In practice, a message broker is a service that transforms network errors and machine failures into filled disks. Then you add more disks. The advantage of publish-subscribe is that it isolates components from each other, but the problem is usually gluing them together.


For short-lived tasks, you want a load balancer

For short-lived tasks, publish-subscribe is a convenient way to build a system quickly, but you inevitably end up implementing a new protocol atop. You have publish-subscribe, but you really want request-response. If you want something computed, you’ll probably want to know the result.

Starting with publish-subscribe makes work assignment easy: jobs get added to the queue, workers take turns to remove them. Unfortunately, it makes finding out what happened quite hard, and you’ll need to add another queue to send a result back.

Once you can handle success, it is time to handle the errors. The first step is often adding code to retry the request a few times. After you DDoS your system, you put a call to sleep(). After you slowly DDoS your system, each retry waits twice as long as the previous.

(Aside: Accidental synchronisation is still a problem, as waiting to retry doesn’t prevent a lot of things happening at once.)

As workers fail to keep up, clients give up and retry work, but the earlier request is still waiting to be processed. The solution is to move some of the queue back to clients, asking them to hold onto work until work has been accepted: back-pressure, or acknowledgements.

Although the components interact via publish-subscribe, we’ve created a request-response protocol atop. Now the message broker is really only doing two useful things: service discovery, and load balancing. It is also doing two not-so-useful thing: enqueuing requests, and persisting them.

For short-lived tasks, the persistence is unnecessary: the client sticks around for as long as the work needs to be done, and handles recovery. The queuing isn’t that necessary either.

Queues inevitably run in two states: full, or empty. If your queue is running full, you haven’t pushed enough work to the edges, and if it is running empty, it’s working as a slow load balancer.

A mostly empty queue is still first-come-first-served, serving as point of contention for requests. A broker often does nothing but wait for workers to poll for new messages. If your queue is meant to run empty, why wait to forward on a request.

(Aside: Something like random load balancing will work, but join-idle-queue is well worth your time investigating)

For distributing short-lived tasks, you can use a message broker, but you’ll be building a load balancer, along with an ad-hoc RPC system, with extra latency.


For long-lived tasks, you’ll need a database

A load balancer with service discovery won’t help you with long running tasks, or work that outlives the client, or manage throughput. You’ll want persistence, but not in your message broker. For long-lived tasks, you’ll want a database instead.

Although the persistence and queueing were obstacles for short-lived tasks, the disadvantages are less obvious for long-lived tasks, but similar things can go wrong.

If you care about the result of a task, you’ll want to store that it is needed somewhere other than in the persistent queue. If the task is run but fails midway, something will have to take responsibility for it, and the broker will have forgotten. This is why you want a database.

Duplicates in a queue often cause more headaches, as long-lived tasks have more opportunities to overlap. Although we’re using the broker to distribute work, we’re also using it implicitly as a mutex. To stop work from overlapping, you implement a lock atop. After it breaks a couple of times, you replace it with leases, adding timeouts.

(Note: This is not why you want a database, using transactions for long running tasks is suffering. Long running processes are best modelled as state machines.)

When the database becomes the primary source of truth, you can handle a broker going offline, or a broker losing the contents of a queue, by backfilling from the database. As a result, you don’t need to directly enqueue work with the broker, but mark it as required in the database, and wait for something else to handle it.

Assuming that something else isn’t a human who has been paged.

A message pump can scan the database periodically and send work requests to the broker. Enqueuing work in batches can be an effective way of making an expensive database call survivable. The pump responsible for enqueuing the work can also track if it has completed, and so handle recovery or retries too.

Backlog is still a problem, so you’ll want to use back-pressure to keep the queue fairly empty, and only fill from the database when needed. Although a broker can handle temporary overload, back-pressure should mean it never has to.

At this point the message broker is really providing two things: service discovery, and work assignment, but really you need a scheduler. A scheduler is what scans a database, works out which jobs need to run, and often where to run them too. A scheduler is what takes responsibility for handling errors.

(Aside: Writing a scheduler is hard. It is much easier to have 1000 while loops waiting for the right time, than one while loop waiting for which of the 1000 is first. A scheduler can track when it last ran something, but the work can’t rely on that being the last time it ran. Idempotency isn’t just your friend, it is your saviour.)

You can use a message broker for long-lived tasks, but you’ll be building a lock manager, a database, and a scheduler, along with yet another home-brew request-response system.


Publish-Subscribe is about isolating components

The problem with running tasks with publish-subscribe is that you really want request-response. The problem with using queues to assign work is that you don’t want to wait for a worker to ask.

The problem with relying on a persistent queue for recovery, is that recovery must get handled elsewhere, and the problem with brokers is nothing else makes service discovery so trivial.

Message brokers can be misused, but it isn’t to say they have no use. Brokers work well when you need to cross system boundaries.

Although you want to keep queues empty between components, it is convenient to have a buffer at the edges of your system, to hide some failures from external clients. When you handle external faults at the edges, you free the insides from handling them. The inside of your system can focus on handling internal problems, of which there are many.

A broker can be used to buffer work at the edges, but it can also be used as an optimisation, to kick off work a little earlier than planned. A broker can pass on a notification that data has been changed, and the system can fetch data through another API.

(Aside: If you use a broker to speed up a process, the system will grow to rely on it for performance. People use caches to speed up database calls, but there are many systems that simply do not work fast enough until the cache is warmed up, filled with data. Although you are not relying on the message broker for reliability, relying on it for performance is just as treacherous.)

Sometimes you want a load balancer, sometimes you’ll need a database, but sometimes a message broker will be a good fit.

Although persistence can’t handle many errors, it is convenient if you need to restart with new code or settings, without data loss. Sometimes the error handling offered is just right.

Although a persistent queue offers some protection against failure, it can’t take responsibility for when things go wrong halfway through a task. To be able to recover from failure you have to stop hiding it, you must add acknowledgements, back-pressure, error handling, to get back to a working system.

A persistent message queue is not bad in itself, but relying on it for recovery, and by extension, correct behaviour, is fraught with peril.


Systems grow by pushing responsibilities to the edges

Performance isn’t easy either. You don’t want queues, or persistence in the central or underlying layers of your system. You want them at the edges.

It’s slow is the hardest problem to debug, and often the reason is that something is stuck in a queue. For long and short-lived tasks, we used back-pressure to keep the queue empty, to reduce latency.

When you have several queues between you and the worker, it becomes even more important to keep the queue out of the centre of the network. We’ve spent decades on tcp congestion control to avoid it.

If you’re curious, the history of tcp congestion makes for interesting reading. Although the ends of a tcp connection were responsible for failure and retries, the routers were responsible for congestion: drop things when there is too much.

The problem is that it worked until the network was saturated, and similar to backlog in queues, when it broke, errors cascaded. The solution was similar: back-pressure. Similar to sleeping twice as long on errors, tcp sends half as many packets, before gradually increasing the amount as things improve.

Back-pressure is about pushing work to the edges, letting the ends of the conversation find stability, rather than trying to optimise all of the links in-between in isolation. Congestion control is about using back-pressure to keep the queues in-between as empty as possible, to keep latency down, and to increase throughput by avoiding the need to drop packets.

Pushing work to the edges is how your system scales. We have spent a lot of time and a considerable amount of money on IP-Multicast, but nothing has been as effective as BitTorrent. Instead of relying on smart routers to work out how to broadcast, we rely on smart clients to talk to each other.

Pushing recovery to the outer layers is how your system handles failure. In the earlier examples, we needed to get the client, or the scheduler to handle the lifecycle of a task, as it outlived the time on the queue.

Error recovery in the lower layers of a system is an optimisation, and you can’t push work to the centre of a network and scale. This is the end-to-end principle, and it is one of the most important ideas in system design.

The end-to-end principle is why you can restart your home router, when it crashes, without it having to replay all of the websites you wanted to visit before letting you ask for a new page. The browser (and your computer) is responsible for recovery, not the computers in between.

This isn’t a new idea, and Erlang/OTP owes a lot to it. OTP organises a running program into a supervision tree. Each process will often have one process above it, restarting it on failure, and above that, another supervisor to do the same.

(Aside: Pipelines aren’t incompatible with process supervision, one way is for each part to spawn the program that reads its output. A failure down the chain can propagate back up to be handled correctly.)

Although each program will handle some errors, the top levels of the supervision tree handle larger faults with restarts. Similarly, it’s nice if your webpage can recover from a fault, but inevitably someone will have to hit refresh.

The end-to-end principle is realising that no matter how many exceptions you handle deep down inside your program, some will leak out, and something at the outer layer has to take responsibility.

Although sometimes taking responsibility is writing things to an audit log, and message brokers are pretty good at that.


Aside: But what about replicated logs?

“How do I subscribe to the topic on the message broker?”

“It’s not a message broker, it’s a replicated log”

“Ok, How do I subscribe to the replicated log”

From ‘I believe I did, Bob’, jrecursive

Although a replicated log is often confused with a message broker, they aren’t immune from handling failure. Although it’s good the components are isolated from each other, they still have to be integrated into the system at large. Both offer a one way stream for sharing, both offer publish-subscribe like interfaces, but the intent is wildly different.

A replicated log is often about auditing, or recovery: having a central point of truth for decisions. Sometimes a replicated log is about building a pipeline with fan-in (aggregating data), or fan-out (broadcasting data), but always building a system where data flows in one direction.

The easiest way to see the difference between a replicated log and a message broker is to ask an engineer to draw a diagram of how the pieces connect.

If the diagram looks like a one-way system, it’s a replicated log. If almost every component talks to it, it’s a message broker. If you can draw a flow-chart, it’s a replicated log. If you take all the arrows away and you’re left with a venn diagram of ‘things that talk to each other’, it’s a message broker.

Be warned: A distributed system is something you can draw on a whiteboard pretty quickly, but it’ll take hours to explain how all the pieces interact.


You cut a monolith with a protocol

How you cut a monolith is often more about how you are cutting up responsibility within a team, than cutting it into components. It really does depend, and often more on the social aspects than the technical ones, but you are still responsible for the protocol you create.

Distributed systems are messy because of how the pieces interact over time, rather than which pieces are interacting. The complexity of a distributed system does not come from having hundreds of machines, but hundreds of ways for them to interact. A protocol must take into account performance, safety, stability, availability, and most importantly, error handling.

When we talk about distributed systems, we are talking about power structures: how resources are allocated, how work is divided, how control is shared, or how order is kept across systems ostensibly built out of well meaning but faulty components.

A protocol is the rules and expectations of participants in a system, and how they are beholden to each other. A protocol defines who takes responsibility for failure.

The problem with message brokers, and queues, is that no-one does.

Using a message broker is not the end of the world, nor a sign of poor engineering. Using a message broker is a tradeoff. Use them freely knowing they work well on the edges of your system as buffers. Use them wisely knowing that the buck has to stop somewhere else. Use them cheekily to get something working.

I say don’t rely on a message broker, but I can’t point to easy off-the-shelf answers. HTTP and DNS are remarkable protocols, but I still have no good answers for service discovery.

Lots of software regularly gets pushed into service way outside of its designed capabilities, and brokers are no exception. Although the bad habits around brokers and the relative ease of getting a prototype up and running lead to nasty effects at scale, you don’t need to build everything at once.

The complexity of a system lies in its protocol not its topology, and a protocol is what you create when you cut your monolith into pieces. If modularity is about building software, protocol is about how we break it apart.

The main task of the engineering analyst is not merely to obtain “solutions” but is rather to understand the dynamic behaviour of the system in such a way that the secrets of the mechanism are revealed, and that if it is built it will have no surprises left for [them]. Other than exhaustive physical experimentations, this is the only sound basis for engineering design, and disregard of this cardinal principle has not infrequently lead to disaster.

From “Analysis of Nonlinear Control Systems” by Dustan Graham and Duane McRuer, p 436

Protocol is the reason why ‘it depends’, and the reason why you shouldn’t depend on a message broker: You can use a message broker to glue systems together, but never use one to cut systems apart.

I like this talk a lot: what modularity is, what we use it for, how modularity happens in systems, and how we can use modularity to manage change.

RIP, Mathie.

Last night I found out i’d lost a friend, and if you’ll be patient with my words, I’d like to reflect a little.

Mathie was one of the older, weirder, geeks I met. I’d escaped my home town in the edge of nowhere, and it was my first time having a peer group of adults.

He’d helped everywhere. With the student run shell server, with the local IRC server everyone collected on, a known and friendly face on the circuit

Mathie was one of the many people behind Scottish Ruby Conference, responsible for bringing a lot of interesting people into Edinburgh, and into my life.

Why wasn’t this talk given at Scottish Ruby Conference

I fucked up

In front of everyone assembled at the fringe track, a collection of talks that didn’t quite make it, mathie answered honestly. It’s kinda how i’ll remember him: a bit of a fuckup.

A fuckup who, changed my life for the better.

Thanks mathie, I hope to pass on some of your kindness.

RIP, You fucking idiot.

PapersWeLove London: End-to-End Arguments In System Design

This week I gave a short talk on a paper I love: End-to-End Arguments in System Design

The talk was recorded and uploaded (but not captioned), and you can watch it here: https://skillsmatter.com/skillscasts/8200-end-to-end-arguments-in-system-design-by-saltzer-reed-and-clark

A million things to do with a computer!

I gave a talk at !!con last weekend, about my favourite programming language scratch:

Back in 1971, Cynthia Solomon and Seymour Papert wrote “Twenty things to do with a computer”, about their experiences of teaching children to use Logo and their ideas for the future.

They were wrong: There’s a lot more than twenty. Logo’s successor, Scratch, has over thirteen million things that children and adults alike have built. Scratch is radically approachable in a way that puts every other language to shame.

This talk is about the history, present, and future of Scratch: why Scratch is about ‘coding to learn’, and not about ‘learning to code’.

I had a incredible time at !!con. The live captioning was fantastic (and they’re crowdfunding a game to teach steno too).

The livestreams are up (but no captions), and my talk is 3h29m32s in on day 2.

Addendum: Write code that is easy to delete, not easy to extend.

I found two translations by accident. I can’t tell if they are perfect translations but I am thankful nonetheless.

(Many people mentioned The Wrong Abstraction, and it is worth mentioning here too.)

Write code that is easy to delete, not easy to extend.

“Every line of code is written without reason, maintained out of weakness, and deleted by chance” Jean-Paul Sartre’s Programming in ANSI C.

Every line of code written comes at a price: maintenance. To avoid paying for a lot of code, we build reusable software. The problem with code re-use is that it gets in the way of changing your mind later on.

The more consumers of an API you have, the more code you must rewrite to introduce changes. Similarly, the more you rely on an third-party api, the more you suffer when it changes. Managing how the code fits together, or which parts depend on others, is a significant problem in large scale systems, and it gets harder as your project grows older.

My point today is that, if we wish to count lines of code, we should not regard them as “lines produced” but as “lines spent” EWD 1036

If we see ‘lines of code’ as ‘lines spent’, then when we delete lines of code, we are lowering the cost of maintenance. Instead of building re-usable software, we should try to build disposable software.

I don’t need to tell you that deleting code is more fun than writing it.

To write code that’s easy to delete: repeat yourself to avoid creating dependencies, but don’t repeat yourself to manage them. Layer your code too: build simple-to-use APIs out of simpler-to-implement but clumsy-to-use parts. Split your code: isolate the hard-to-write and the likely-to-change parts from the rest of the code, and each other. Don’t hard code every choice, and maybe allow changing a few at runtime. Don’t try to do all of these things at the same time, and maybe don’t write so much code in the first place.


Step 0: Don’t write code

The number of lines of code doesn’t tell us much on its own, but the magnitude does 50, 500 5,000, 10,000, 25,000, etc. A million line monolith is going to be more annoying than a ten thousand line one and significantly more time, money, and effort to replace.

Although the more code you have the harder it is to get rid of, saving one line of code saves absolutely nothing on its own.

Even so, the easiest code to delete is the code you avoided writing in the first place.


Step 1: Copy-paste code

Building reusable code is something that’s easier to do in hindsight with a couple of examples of use in the code base, than foresight of ones you might want later. On the plus side, you’re probably re-using a lot of code already by just using the file-system, why worry that much? A little redundancy is healthy.

It’s good to copy-paste code a couple of times, rather than making a library function, just to get a handle on how it will be used. Once you make something a shared API, you make it harder to change.

The code that calls your function will rely on both the intentional and the unintentional behaviours of the implementation behind it. The programmers using your function will not rely on what you document, but what they observe.

It’s simpler to delete the code inside a function than it is to delete a function.


Step 2: Don’t copy paste code

When you’ve copy and pasted something enough times, maybe it’s time to pull it up to a function. This is the “save me from my standard library” stuff: the “open a config file and give me a hash table”, “delete this directory”. This includes functions without any state, or functions with a little bit of global knowledge like environment variables. The stuff that ends up in a file called “util”.

Aside: Make a util directory and keep different utilities in different files. A single util file will always grow until it is too big and yet too hard to split apart. Using a single util file is unhygienic.

The less specific the code is to your application or project, the easier they are to re-use and the less likely to change or be deleted. Library code like logging, or third party APIs, file handles, or processes. Other good examples of code you’re not going to delete are lists, hash tables, and other collections. Not because they often have very simple interfaces, but because they don’t grow in scope over time.

Instead of making code easy-to-delete, we are trying to keep the hard-to-delete parts as far away as possible from the easy-to-delete parts.


Step 3: Write more boilerplate

Despite writing libraries to avoid copy pasting, we often end up writing a lot more code through copy paste to use them, but we give it a different name: boilerplate. Boiler plate is a lot like copy-pasting, but you change some of the code in a different place each time, rather than the same bit over and over.

Like with copy paste, we are duplicating parts of code to avoid introducing dependencies, gain flexibility, and pay for it in verbosity.

Libraries that require boilerplate are often stuff like network protocols, wire formats, or parsing kits, stuff where it’s hard to interweave policy (what a program should do), and protocol (what a program can do) together without limiting the options. This code is hard to delete: it’s often a requirement for talking to another computer or handling different files, and the last thing we want to do is litter it with business logic.

This is not an exercise in code reuse: we’re trying keep the parts that change frequently, away from the parts that are relatively static. Minimising the dependencies or responsibilities of library code, even if we have to write boilerplate to use it.

You are writing more lines of code, but you are writing those lines of code in the easy-to-delete parts.


Step 4: Don’t write boilerplate

Boilerplate works best when libraries are expected to cater to all tastes, but sometimes there is just too much duplication. It’s time to wrap your flexible library with one that has opinions on policy, workflow, and state. Building simple-to-use APIs is about turning your boilerplate into a library.

This isn’t as uncommon as you might think: One of the most popular and beloved python http clients, requests, is a successful example of providing a simpler interface, powered by a more verbose-to-use library urllib3 underneath. requests caters to common workflows when using http, and hides many practical details from the user. Meanwhile, urllib3 does the pipelining, connection management, and does not hide anything from the user.

It is not so much that we are hiding detail when we wrap one library in another, but we are separating concerns: requests is about popular http adventures, urllib3 is about giving you the tools to choose your own adventure.

I’m not advocating you go out and create a /protocol/ and a /policy/ directory, but you do want to try and keep your util directory free of business logic, and build simpler-to-use libraries on top of simpler-to-implement ones. You don’t have to finish writing one library to start writing another atop.

It’s often good to wrap third party libraries too, even if they aren’t protocol-esque. You can build a library that suits your code, rather than lock in your choice across the project. Building a pleasant to use API and building an extensible API are often at odds with each other.

This split of concerns allows us to make some users happy without making things impossible for other users. Layering is easiest when you start with a good API, but writing a good API on top of a bad one is unpleasantly hard. Good APIs are designed with empathy for the programmers who will use it, and layering is realising we can’t please everyone at once.

Layering is less about writing code we can delete later, but making the hard to delete code pleasant to use (without contaminating it with business logic).


Step 5: Write a big lump of code

You’ve copy-pasted, you’ve refactored, you’ve layered, you’ve composed, but the code still has to do something at the end of the day. Sometimes it’s best just to give up and write a substantial amount of trashy code to hold the rest together.

Business logic is code characterised by a never ending series of edge cases and quick and dirty hacks. This is fine. I am ok with this. Other styles like ‘game code’, or ‘founder code’ are the same thing: cutting corners to save a considerable amount of time.

The reason? Sometimes it’s easier to delete one big mistake than try to delete 18 smaller interleaved mistakes. A lot of programming is exploratory, and it’s quicker to get it wrong a few times and iterate than think to get it right first time.

This is especially true of more fun or creative endeavours. If you’re writing your first game: don’t write an engine. Similarly, don’t write a web framework before writing an application. Go and write a mess the first time. Unless you’re psychic you won’t know how to split it up.

Monorepos are a similar tradeoff: You won’t know how to split up your code in advance, and frankly one large mistake is easier to deploy than 20 tightly coupled ones.

When you know what code is going to be abandoned soon, deleted, or easily replaced, you can cut a lot more corners. Especially if you make one-off client sites, event web pages. Anything where you have a template and stamp out copies, or where you fill in the gaps left by a framework.

I’m not suggesting you write the same ball of mud ten times over, perfecting your mistakes. To quote Perlis: “Everything should be built top-down, except the first time”. You should be trying to make new mistakes each time, take new risks, and slowly build up through iteration.

Becoming a professional software developer is accumulating a back-catalogue of regrets and mistakes. You learn nothing from success. It is not that you know what good code looks like, but the scars of bad code are fresh in your mind.

Projects either fail or become legacy code eventually anyway. Failure happens more than success. It’s quicker to write ten big balls of mud and see where it gets you than try to polish a single turd.

It’s easier to delete all of the code than to delete it piecewise.


Step 6: Break your code into pieces

Big balls of mud are the easiest to build but the most expensive to maintain. What feels like a simple change ends up touching almost every part of the code base in an ad-hoc fashion. What was easy to delete as a whole is now impossible to delete piecewise.

In the same we have layered our code to separate responsibilities, from platform specific to domain specific, we need to find a means to tease apart the logic atop.

[Start] with a list of difficult design decisions or design decisions which are likely to change. Each module is then designed to hide such a decision from the others. D. Parnas

Instead of breaking code into parts with common functionality, we break code apart by what it does not share with the rest. We isolate the most frustrating parts to write, maintain, or delete away from each other.

We are not building modules around being able to re-use them, but being able to change them.

Unfortunately, some problems are more intertwined and hard to separate than others. Although the single responsibility principle suggests that ‘each module should only handle one hard problem’, it is more important that ‘each hard problem is only handled by one module’

When a module does two things, it is usually because changing one part requires changing the other. It is often easier to have one awful component with a simple interface, than two components requiring a careful co-ordination between them.

I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description [”loose coupling”], and perhaps I could never succeed in intelligibly doing so. But I know it when I see it, and the code base involved in this case is not that. SCOTUS Justice Stewart

A system where you can delete parts without rewriting others is often called loosely coupled, but it’s a lot easier to explain what one looks like rather than how to build it in the first place.

Even hardcoding a variable once can be loose coupling, or using a command line flag over a variable. Loose coupling is about being able to change your mind without changing too much code.

For example, Microsoft Windows has internal and external APIs for this very purpose. The external APIs are tied to the lifecycle of desktop programs, and the internal API is tied to the underlying kernel. Hiding these APIs away gives Microsoft flexibility without breaking too much software in the process.

HTTP has examples of loose coupling too: Putting a cache in front of your HTTP server. Moving your images to a CDN and just changing the links to them. Neither breaks the browser.

HTTP’s error codes are another example of loose coupling: common problems across web servers have unique codes. When you get a 400 error, doing it again will get the same result. A 500 may change. As a result, HTTP clients can handle many errors on the programmers behalf.

How your software handles failure must be taken into account when decomposing it into smaller pieces. Doing so is easier said than done.

I have decided, reluctantly to use LaTeX. Making reliable distributed systems in the presence of software errors. Armstrong, 2003

Erlang/OTP is relatively unique in how it chooses to handle failure: supervision trees. Roughly, each process in an Erlang system is started by and watched by a supervisor. When a process encounters a problem, it exits. When a process exits, it is restarted by the supervisor.

(These supervisors are started by a bootstrap process, and when a supervisor encounters a fault, it is restarted by the bootstrap process)

The key idea is that it is quicker to fail-fast and restart than it is to handle errors. Error handling like this may seem counter-intuitive, gaining reliability by giving up when errors happen, but turning things off-and-on again has a knack for suppressing transient faults.

Error handling, and recovery are best done at the outer layers of your code base. This is known as the end-to-end principle. The end-to-end principle argues that it is easier to handle failure at the far ends of a connection than anywhere in the middle. If you have any handling inside, you still have to do the final top level check. If every layer atop must handle errors, so why bother handling them on the inside?

Error handling is one of the many ways in which a system can be tightly bound together. There are many other examples of tight coupling, but it is a little unfair to single one out as being badly designed. Except for IMAP.

In IMAP almost every each operation is a snowflake, with unique options and handling. Error handling is painful: errors can come halfway through the result of another operation.

Instead of UUIDs, IMAP generates unique tokens to identify each message. These can change halfway through the result of an operation too. Many operations are not atomic. It took more than 25 years to get a way to move email from one folder to another that reliably works. There is a special UTF-7 encoding, and a unique base64 encoding too.

I am not making any of this up.

By comparison, both file systems and databases make much better examples of remote storage. With a file system, you have a fixed set of operations, but a multitude of objects you can operate on.

Although SQL may seem like a much broader interface than a filesystem, it follows the same pattern. A number of operations on sets, and a multitude of rows to operate on. Although you can’t always swap out one database for another, it is easier to find something that works with SQL over any homebrew query language.

Other examples of loose coupling are other systems with middleware, or filters and pipelines. For example, Twitter’s Finagle uses a common API for services, and this allows generic timeout handling, retry mechanisms, and authentication checks to be added effortlessly to client and server code.

(I’m sure if I didn’t mention the UNIX pipeline here someone would complain at me)

First we layered our code, but now some of those layers share an interface: a common set of behaviours and operations with a variety of implementations. Good examples of loose coupling are often examples of uniform interfaces.

A healthy code base doesn’t have to be perfectly modular. The modular bit makes it way more fun to write code, in the same way that Lego bricks are fun because they all fit together. A healthy code base has some verbosity, some redundancy, and just enough distance between the moving parts so you won’t trap your hands inside.

Code that is loosely coupled isn’t necessarily easy-to-delete, but it is much easier to replace, and much easier to change too.


Step 7: Keep writing code

Being able to write new code without dealing with old code makes it far easier to experiment with new ideas. It isn’t so much that you should write microservices and not monoliths, but your system should be capable of supporting one or two experiments atop while you work out what you’re doing.

Feature flags are one way to change your mind later. Although feature flags are seen as ways to experiment with features, they allow you to deploy changes without re-deploying your software.

Google Chrome is a spectacular example of the benefits they bring. They found that the hardest part of keeping a regular release cycle, was the time it took to merge long lived feature branches in.

By being able to turn the new code on-and-off without recompiling, larger changes could be broken down into smaller merges without impacting existing code. With new features appearing earlier in the same code base, it made it more obvious when long running feature developement would impact other parts of the code.

A feature flag isn’t just a command line switch, it’s a way of decoupling feature releases from merging branches, and decoupling feature releases from deploying code. Being able to change your mind at runtime becomes increasingly important when it can take hours, days, or weeks to roll out new software. Ask any SRE: Any system that can wake you up at night is one worth being able to control at runtime.

It isn’t so much that you’re iterating, but you have a feedback loop. It is not so much you are building modules to re-use, but isolating components for change. Handling change is not just developing new features but getting rid of old ones too. Writing extensible code is hoping that in three months time, you got everything right. Writing code you can delete is working on the opposite assumption.

The strategies i’ve talked about — layering, isolation, common interfaces, composition — are not about writing good software, but how to build software that can change over time.

The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. […] Hence plan to throw one away; you will, anyhow. Fred Brooks

You don’t need to throw it all away but you will need to delete some of it. Good code isn’t about getting it right the first time. Good code is just legacy code that doesn’t get in the way.

Good code is easy to delete.




Acknowledgments

Thank you to all of my proof readers for your time, patience, and effort.

Further Reading

Layering/Decomposition

On the Criteria To Be Used in Decomposing Systems into Modules, D.L. Parnas.

How To Design A Good API and Why it Matters, J. Bloch.

The Little Manual of API Design, J. Blanchette.

Python for Humans, K. Reitz.


Common Interfaces

The Design of the MH Mail System, a Rand technical report.

The Styx Architecture for Distributed Systems

Your Server as a Function, M. Eriksen.


Feedback loops/Operations lifecycle

Chrome Release Cycle, A. Laforge.

Why Do Computers Stop and What Can Be Done About It?, J. Gray.

How Complex Systems Fail, R. I. Cook.


The technical is social before it is technical.

All Late Projects Are the Same, Software Engineering: An Idea Whose Time Has Come and Gone?, T. DeMarco.

Epigrams in Programming, A. Perlis.

How Do Committees Invent?, M.E. Conway.

The Tyranny of Structurelessness, J. Freeman

This is short, and packed with the voice of experience.

I got my talk transcribed, and now it has subtitles in english.