programming is terriblelessons learned from a life wasted

Framework First, Product Later

Years ago I joined a small company, spun out from another startup, to turn a patent into profit. The patent was about a better way to run a chat service, similar to IRC.

Chat servers have to keep track of clients, the chatrooms they are in, and deliver messages between clients. Normally when you have multiple servers involved, each of these must keep track of all the clients, chatrooms and messages too. With the patent, it offered a way to spread the load across all the machines, rather than duplicating the work.

To distribute the work, the rooms would be managed by one machine in particular and rooms could be moved from one machine to another. To make these migrations invisible, routers sat between clients and the rooms, buffering messages while the room was in transit. This meant that you could add new machines or remove them, without interrupting the conversations.

Although i’ve talked about chatrooms and clients, the patent was really about migrating processes. The process could work as a chatroom, or it could implement some special logic and keep track of extra information. When the process moved, the extra information was passed along too.

The patent solved a hard problem and promised easy scaling, but it came at a price—the extra routers introduced latency, processes could not communicate with each other, and there was no real way to split one large process across two servers. If your problem didn’t neatly break into isolated pieces, it was a very tough fit. Despite these drawbacks, and because Cloud, Elastic and Scale were fashionable, we went in search of applications to build.

I asked my Erlang loving friends for help. They thought process migration was a neat idea, but instead of processes with special data, they wrote processes without any. This elimination of state means If a process dies, you can just restart it. To them, a migration was really killing one process and starting another somewhere else. When I relayed my findings, optimism prevailed–

We will just need to make those things stateful to take advantage of our system

With state it isn’t so easy to be reliable—you need to run multiple copies of the process, so when one copy dies you can continue using the others. The patent covered migrating one process, not multiple ones, so we added backup servers. Before a process moved, the backups would be terminated, and after the move new backups would be created. This didn’t work so well in practice, and for a while turning on backups made the system crash. Eventually I realised that the problem was much harder—migration and reliability were conflicting goals.

Migration implies that the service can only exist in one place. Reliability requires keeping multiple copies in sync. Instead of moving processes from one machine to another, we are really making new copies, and terminating old ones. However the patent covered migration, so no matter what, migration was the solution. We wrote the framework around the patent, and the demo applications were constrained in turn.

We demoed an airplane booking system. By using one process per flight you could elastically scale with demand, with the minor drawback that you couldn’t search across multiple flights. The chat room demo we had couldn’t create new rooms.

Without a real application, features proliferated. Instead of making choices, we made options. If you encounter a framework that wasn’t extracted from a product, run away screaming, no matter how extensible it claims to be.

Patented or not, If the developers haven’t made anything useful using their framework, neither will you.