programming is terriblelessons learned from a life wasted

Why Do Computers Stop and What Can Be Done About It?

Why Do Computers Stop and What Can Be Done About It? (hpl.hp.com)

Earlier this week I was lucky enough to see Joe Armstrong talk about the principles behind Erlang. During the talk he mentioned one of my favourite technical reports—Why do Computers stop? by Jim Gray for Tandem Computers.

http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Tandem Computers built fault tolerant and high availably machines before the web existed. The uptime of their machines was legendary, as an apocryphal technical support call demonstrates:

Hi, is this Support? We have a problem with our Tandem: A car bomb exploded outside the bank, and the machine has fallen over. …. No, No it hasn’t crashed, it’s still running, just on its side. We were wondering if we can move it without breaking it.

The report offers some insight as to how these legends were born—isolation, failing fast, transactional updates, process pairs and supervision—as well as some interesting statistics and observations.

For a little more analysis and a summary, I’d recommend reading @mononcqc’s excellent summary, especially to those who want to skip straight to the good bits of the report.