HTTP as Imagined versus HTTP as Found

(source: LiveJournal Engineering, frumper 15 on TR Forums)
On the left is a pristine network diagram. On the right is how most networks look in practice. This isn’t to bemoan the death of cable lacing, but an illustration of what Richard Cook calls ‘Systems as Imagined’ versus 'Systems as found’ (stolen from his excellent talk, How Complex Systems Fail).
With any complex system there is always a conflict between intent and implementation, and with protocols there is a always a mismatch between specification and reality. If you make a specification too small, implementation specific behaviour creeps in, and if you make it too long, no-one has a chance of implementing it correctly. A good specification has to strike a balance between prescriptive to descriptive, explaining how things should work, but begrudgingly admitting how existing implementations behave. HTTP is no exception.
From the outset, HTTP seems like a very simple protocol. You make a request, you get a response back. 'HTTP as Imagined’ is rather straight forward:
GET / HTTP/1.1<CRLF>
Host: www.example.org<CRLF>
<CRLF>
200 OK<CRLF>
Content-Type: text/plain<CRLF>
Content-Length: 5<CRLF>
<CRLF>
hello
Although RFC 2616 officially defines HTTP, HTTP is also defined by how popular browsers and web servers behave. The RFC is over a decade old, so these behaviours are often discovered through spelunking source code, or nightmare debugging sessions. A robust implementation needs to handle the obscure edge cases of the standard, and the mind boggling way in which others have implemented HTTP. For example:
- Headers can span multiple lines.
- Line terminators are meant to be CRLF, but code should accept a solitary LF.
- Blank lines can appear before the first line of a request.
- Response start lines may only have a code, not a phrase.
- Deflate, gzip are used interchangeably.
- Get messages can have a body, but not every server knows this.
- Some servers will let you get away with the full url in the request line.
- You can’t accurately parse a http response without knowing the method used.
- The length of a response body is indicated by a mixture of the response code, the Transfer-Encoding header, Content-Length header, Connection header (and the request method).
A simple request and response may not be so simple in the wild. 'HTTP as Found’ may require sick-bags.
GET http://www.example.org/ HTTP/1.1<LF>
Host:<LF>
www.example.org<CRLF>
Content-Length:0<CRLF>
<LF>
100 Continue<CRLF>
<CRLF>
200<CRLF>
Transfer-Encoding: chunked<CRLF>
<CRLF>
5<CRLF>
hello
0<CRLF>
Content-Type: text/plain<CRLF>
<CRLF>
If this looks bad for HTTP/1,1, at least it isn’t HTTP/0.9. This prehistoric version of HTTP is simpler to parse, in the sense that there are no headers or start line in the response, just the content. Despite being as old as the web, there are situations where a modern robust HTTP server will return such a decrepit response.
GET /a_very_long_url............................ HTTP/1.1<CRLF>
Host: www.example.org<CRLF>
<CRLF>
hello
If you ask for an enormous URL, some servers will only process the first thousand characters of the request, without seeing the HTTP/1.1 at the end. In particular, NGINX will assume it is a HTTP/0.9 request, and then strip the header from the response. Robust browsers will fail to parse the response as HTTP/1 or above, assume it’s HTTP/0.9, and render the entirety of the response as HTML.
(If this disgusts you, don’t look at character set detection)
Thankfully, the newest draft of HTTP captures much the folk knowledge needed to write robust implementations. You may feel dirty, but at least your code works.