programming is terriblelessons learned from a life wasted

HTTP as Imagined versus HTTP as Found

A clean diagram of a network setup, juxtaposed with a mess of network cabling

(source: LiveJournal Engineering, frumper 15 on TR Forums)

On the left is a pristine network diagram. On the right is how most networks look in practice. This isn’t to bemoan the death of cable lacing, but an illustration of what Richard Cook calls ‘Systems as Imagined’ versus ‘Systems as found’ (stolen from his excellent talk, How Complex Systems Fail).

With any complex system there is always a conflict between intent and implementation, and with protocols there is a always a mismatch between specification and reality. If you make a specification too small, implementation specific behaviour creeps in, and if you make it too long, no-one has a chance of implementing it correctly. A good specification has to strike a balance between prescriptive to descriptive, explaining how things should work, but begrudgingly admitting how existing implementations behave. HTTP is no exception.

From the outset, HTTP seems like a very simple protocol. You make a request, you get a response back. ‘HTTP as Imagined’ is rather straight forward:

GET / HTTP/1.1<CRLF>
Host: www.example.org<CRLF>
<CRLF>


200 OK<CRLF>
Content-Type: text/plain<CRLF>
Content-Length: 5<CRLF>
<CRLF>
hello

Although RFC 2616 officially defines HTTP, HTTP is also defined by how popular browsers and web servers behave. The RFC is over a decade old, so these behaviours are often discovered through spelunking source code, or nightmare debugging sessions. A robust implementation needs to handle the obscure edge cases of the standard, and the mind boggling way in which others have implemented HTTP. For example:

A simple request and response may not be so simple in the wild. ‘HTTP as Found’ may require sick-bags.


GET http://www.example.org/ HTTP/1.1<LF>
Host:<LF>
 www.example.org<CRLF>
Content-Length:0<CRLF>
<LF>


100 Continue<CRLF>
<CRLF>
200<CRLF>
Transfer-Encoding: chunked<CRLF>
<CRLF>
5<CRLF>
hello
0<CRLF>
Content-Type: text/plain<CRLF>
<CRLF>

If this looks bad for HTTP/1,1, at least it isn’t HTTP/0.9. This prehistoric version of HTTP is simpler to parse, in the sense that there are no headers or start line in the response, just the content. Despite being as old as the web, there are situations where a modern robust HTTP server will return such a decrepit response.

GET /a_very_long_url............................ HTTP/1.1<CRLF>
Host: www.example.org<CRLF>
<CRLF>


hello

If you ask for an enormous URL, some servers will only process the first thousand characters of the request, without seeing the HTTP/1.1 at the end. In particular, NGINX will assume it is a HTTP/0.9 request, and then strip the header from the response. Robust browsers will fail to parse the response as HTTP/1 or above, assume it’s HTTP/0.9, and render the entirety of the response as HTML.

(If this disgusts you, don’t look at character set detection)

Thankfully, the newest draft of HTTP captures much the folk knowledge needed to write robust implementations. You may feel dirty, but at least your code works.