[Seventh Heaven] Who Killed Gopher?

Rohit Khare (rohit@elysees.ICS.uci.edu)
Wed, 06 Jan 1999 12:40:37 -0800


Who Killed Gopher?

An Extensible Murder Mystery


By Rohit Khare // December 23, 1998
_________________________________________________________________

As best our forensics team can reconstruct, a serial killer first
surfaced in the beginning of 1991, born of an academic's midnight hack
gone awry. NeXTstep users, you see -- precisely the kind of
object-oriented revolutionaries that would think they could get away
with inventing a brand new protocol for a mature Internet.

The first victim was the campus phonebook: it dismembered the whole
corpus into tiny bits of `hypertext', left hanging together by search
strings alone. Here we see its modus operandi: not content to slice
information into neat lists or trees, it left behind completely
unstructured graphs littered with redirects hither and yon across the
Internet to co-conspirators' servers.

Hegemonic fantasies drove it to assimilate more and more multimedia
formats at each node. Crude at first, shoveling any old file through
an 8-bit clean channel, it grew more explicit until it could label
text, images, audio, even compound documents. By now, it can even
masquerade behind several renderings, negotiating whatever face -- or
language! -- pleases the victim.

As a protocol, it was a blunt weapon: aim a socket at the target, send
a one-line demand, and slurp down the response until the connection
died. The crazy way it grew from there is proof itself that the killer
dropped out of Application Layer Protocol design school at an early
age it shows no understanding of the basics of TCP, botching
handshakes, slow-start, server-sends-first-byte, even tripping over
the Nagle timer. Instead, in a stroke of streetwise punk genius, it
stayed 'simple' enough that any two-bit Telnet client could take a
hit.

And what an addictive product it was! Workgroupies were seduced by the
ease of installing new servers and authoring new content -- and
pushing free software helped the lock-in cycle along. The killer
preyed on individual users' self-esteem by promising instant fame*, in
the form of a backlink from the master list of servers.

*Anyone have firm attribution for the famous quote, "On the
Internet, everyone will be famous for 15 people"?

Soon, it was assimilating more than bags of bits: it was a gateway
drug to harder services. Scripts, database lookups, even
long-established news, file, and email access transactions were
reduced to mere documents in a menu. The address sent in the 'demand'
grew into a miniature RPC.

All the while, with every new user, every new server, every nugget of
information it wrested from 'legacy' protocols, the killer bled off a
little more of the Internet's bandwidth -- getting away with murder
with every handshake!

Such profligacy is no idle threat: for a while, the killer was the
largest share of application packet traffic on the national NSFNET
backbone.

Or was that the victim?

Did the killer come knocking on port 70 or port 80? From
boombox.micro.umn.edu or info.cern.ch? Bearing the imprimatur of
campus administrators or physicists? Are we hunting gophers or
spiders?

Dial 'H' For Murder

In fact, the profile in Table 1 fits both suspects to a 'T'. In the
early '90s, an interesting drama played out between two contenders for
the first integrated, interactive internet information retrieval
protocols -- neither of which, arguably, had any technical reason to
exist.

That's not to say the systems weren't inevitable: everyone suspected
there'd have to be an easier way to use the Internet in the '90s than
UNIX shell experience. That encompasses innovations in user interface,
authoring formats, and above all, resource identification. It does
not, however, motivate the original editions of the Gopher and
HyperText Transfer Protocols. They were painfully trivial hacks that
did nothing FTP didn't already handle.

So here begins our mysterious tale: why either protocol ever rose to
prominence in the first place; and how the fratricidial drama
eventually played out. Understanding how HTTP killed gopher may lead
us to the forces which may, in turn, topple HTTP.

Science proves a blind alley, though. Their fates were decided not on
technical merits, but on economic and psychological advantages. The
'Postellian' school of protocol design focused on engineering 'right'
solutions for core applications (batch file transfer, interactive
terminals, mail and news relays) anchored in unique transport layer
adaptations (slow-start, Nagle timers, and routing as respective
examples). Our two specimens are 'post-Postel', in their details and
in their adoption dynamics. They are stateless; they don't have
(Gopher) or dilute (HTTP) the theory of reply codes; they scale
poorly, imperiling the health of the Internet; and they are 'luxuries'
for publishing discretionary information, not Host Requirements which
must be compiled into every node.

Burrowing into Gopher

While the Web was arguably invented a few months before Gopher, the
latter was the first to garner public notice. By the fall of 1994, RFC
1689's census of information retrieval tools estimated there were
4,800 Gophers, and 1,200 anonymous FTP archives, but only 600 Webs.
Early and "explosive" adoption had a lot to do with Gopher's "fiercely
simple" design, and hence its wide availabilty on a number of
students' (DOS, Windows, Mac, NeXT) and campus IS departments'
(UNIXes, VMS, VM/CMS, MVS) platforms.

It was designed by the microcomputer support folks at the University
of Minnesota for a teletext-like Campus-Wide Information System
(CWIS). The original prototype was a multicolumn tree-browser
pioneered for the NeXTstep file browser. Each scrolling column had a
list of titles; clicking on an entry would either display that file in
the pane below, or load the child directory listing in the column to
the right.

Naturally, the messages on the wire follow directly: open a connection
to the target server on port 70; send a one-line 'selector' string;
get back either the file itself, or a list of further selector lines,
each prefixed by a character indicating the type of file it points to.
Table 2 lists the rudimentary type system Gopher shipped with,
illustrating on one hand its ambitions to assimilate new media types
and on the other what a thin skein it was around protocol- and
platform-specific data formats.

The flip side of 'thin skein' is 'ease of implementation', though. In
effect, Gopher was a (very!) poor man's version of FTP, sans
authentication, navigation, authoring, or bulk/restartable transfers.
It only used TCP as a half-duplex fopen() call. It just happened that
some of the "files" it transferred were menus, in a Gopher-specific,
tab-delimited, US-ASCII format. So a Gopher server did little more
than read files out of a specified directory tree; and clients could
be hacked out of spare Telnet client code.

A little further up the evolutionary tree, servers allowed scripts and
other programs to be served as well. Rather than copying a file back
across the socket, it connected it to some process's output.
Similarly, multiprotocol clients could recognize a non-Gopher selector
and switch to another language (as for phonebook queries). "Gateways
exist to seamlessly access a variety of non-Gopher services such as
ftp, WAIS, USENET news, Archie, Z39.50 (1992 rev), X.500 directories,
Sybase and Oracle SQL servers, etc."

The base protocol didn't leverage TCP well, though. For example, TCP's
three-way handshake ends with a packet from the server to the client
-- so even when the client opens the connection, the server can send
data first for free. Gopher did not have a server-hello message, and
thus no version number to key upgrades off of. To signal the end of a
particular transmission, Gopher borrowed the SMTP/NNTP convention of
<CR><LF>.<CR><LF>. This only worked for its own menu format, though --
as soon as an actual file was transferred, TCP FIN had to be used as
the EOF.

This left lots of socket buffers in the FIN_WAIT state at the server
side, just like the Web. At the same time, the Gopher Team was
planning to scale to Internet-wide use. They did plan for redundant
backup servers: a '+' line was an alternate for the selector on the
previous line. They maintained a master list of publically-accessible
servers, which was in turn indexed regularly by Veronica (Very Easy
Rodent-Oriented Net-wide Index to Computerized Archives, by analogy to
the Archie FTP server index).

In the long-run, though, no one ever plastered Gopher selectors on
shirts and lunch trucks and golf tees...

Spinning the Web

"The key insight I credit Tim Berners-Lee with, is the URL: the
idea that there's a Uniform Resource Locator that says I can point
at any bit of information on the Internet." -- some obscure Web
scholar in Stephen Segaller's new book Nerds 2.01: A Brief History
of the Internet

If anything, Tim's original Web manifesto was even more ambitious than
Gopher's. Its hook for assimilating "the entire universe of
network-accessible information" was the URI, which is able to subsume
entire namespaces as merely one more scheme (telnet:, mail:, and so
on). URIs reduced operational instructions on how to access a resource
to a declarative address. For example, contrast the one-line ftp: URL
with the MIME message/external-body part for FTP access -- logins,
passwords, alternate sites, alternate directories, alternate formats.
As RFC 1630 intended "URIs may, if necessary, be passed using pen and
ink."

But while URIs were the innovative essence of the Web, the other two
legs of the triad were less steady. HTML is merely a markup language
which makes it natural to embed URIs as hyperlinks. URIs could just as
well have been popularized as a pointer format within Rich Text
Format, PostScript, LaTeX, or any other language. HTTP is merely a
lookup protocol which makes it natural to resolve http: URLs or
gateway to other schemes. URIs could just as well have been
popularized as a naming interface to existing clients for FTP, Gopher,
News, etc.

Somehow, though, HTTP/0.9 took root. It is a famously 'simple'
protocol, arguably even less useful than Gopher. Here is its entire
grammar:

Simple-Request = "GET" SP Request-URI CRLF
Simple-Response = [ Entity-Body ]

That's right: it only supported one method, GET, and required a new
TCP connection for every transaction. Even the 1.0 edition, which
eventually added authentication and navigation and simultaneous
transfers of several files for a single "page", had no excuse to be
born when FTP was on the shelf. Except that HTTP had broken
authentication, nonstandard conventions for navigating directories and
welcome/index/home pages, and ran spectacularly afoul of slow-start
for each of its typically-small file transfers.

If HTTP/0.9 was really shipped for no better reason than that
programming separate FTP-Control and FTP-Data channels seemed too
hard, why did it survive? Many of the earliest "web" sites were, in
fact, just ftp: URLs. The saving grace was 1.0's new metadata headers.
It augmented the one-line request with additional fields for
authentication, arguments modifying the method, and information about
the user-agent's preferred languages and formats. The entity bodies it
returned were prefixed by MIME Content-Types, last-modified dates,
base-URI, and more. This information, finally, was what took HTTP
beyond the realm of file transfer to hypertext transfer.

The headers are how HTTP's adoption cycle diverged from Gopher's. Both
servers were simple to install and author content for -- there was
even a hybrid gn server which could vend files over both port 70 and
80. There were kudos for new sites, in the form of the Web Virtual
Library project at CERN, and later W3C, and early search engines like
Brian Pinkerton's NeXTstep DigitalLibrarian-derived WebCrawler. The
client was the defining wedge. Graphical web browsers leveraged rich
HTML text with MIME media type headers to display graphical images, to
launch helper applications, and present multilingual content.

Extensibility: the fingerprint of a killer

The straw that broke the Gopher's back wasn't even a feature of HTTP
per se. The synergy of HTML and HTTP in the graphical browser began
with FORMs, through refresh META tags, up through scripting and applet
OBJECTs. While one protocol stayed lean, the other opened up so much
headroom that it grew into a beast so complex even today's base 1.1
specification takes 160+ pages.

An avalanche of new headers modified GET to only retrieve the full
entity if-modified-since an existing copy, or only a specified
byte-range; enhanced security with keyed-digest passwords; even
reverse-engineered state back into HTTP using 'cookies.' HTTP reply
codes were added for payments, for cache validation, for upgrading to
new protocols.

Entirely new functionality could be grafted on HTTP using new methods.
WebDAV added LOCK and PUT; collections which allowed MOVE and COPY on
entire swaths of URLspace; and even went beyond new headers to encoded
additional parameters in XML. Conversely, HTTP was borrowed wholesale
for new protocols for printing (IPP), multiparty call setup (SIP), and
event notification (RVP).

Politically, any server or client could inaugurate a new method;
third-parties could even thread themselves into the proxy chain and
filter messages in transit. Proxies exist to add public annotations
(Crit-Link Mediator), anonymize access (Lucent Personal Web
Assistant), convert file formats (ProxiNet for PalmIIIs), language
translation (AltaVista's BabelFish can be so hacked), and content
filtering (PICS). Decentralized extensiblity was key to HTTP's
adoption. Rather than waiting for the IETF standards process to revise
a canonical table of one-letter content codes, for example, MIME
headers could deploy new content-types on the spot.

At the same time, uncontrolled divergence reduces the value of the Web
platform as the probability of compatible extensions matching up
declines. New versions of HTTP would appear to be one solution, but
there is little political will to empanel a standing working group to
review a grab-bag of extensions for 1.2, 1.3, ... 39.7 and so on. What
is neccessary is some hook for identifying extensions, the level of
support required, and which parts of URLspace are affected.

W3C has been promoting such mechanisms for years -- three calendar
years! -- to better "Lead the [Controlled] Evolution of the World Wide
Web." At first, the motivating scenarios were complex economic and
social negotiations (payments, content selection, privacy,
cryptographic algorithms, &c). Early revisions of PEP advertised
preference lists of alternative extensions, applied in pipelined
sequence. PEP could also transport metadata about which extensions
applied to other resources and required orders.

Though PEP remained a W3C experiment, the basic ideas of 1) packaging
an entire extension -- methods, headers, encodings and all -- behind a
single name and 2) indicating clearly if an extensions was required or
optional for 3) the next hop or end-to-end were reincarnated as the
Mandatory proposal. The current draft defines four header fields
enumerating the URIs of the extensions which apply (Man: and Opt: for
client and server; C-Man: and C-Opt: for the next hop). To
interoperate with the installed base, any request with mandatory
extensions prefixes the method name with 'M-', which forces 1.0 and
1.1 servers to fail with "501 Not Implemented."

Transport: the Achilles' heel

By @@verify@@April 1994, Web usage passed Gopher's share as well as
every other Internet application protocol as a fraction of NSFnet
backbone usage. Even patching its bandwidth leaks with 1.1's
persistent connections can't mask HTTP's inefficient waste of Internet
resources. Its content base has grown so immense that caching alone
can't reduce bandwidth demand by a significant margin for the average
user or ISP. Former strengths have now become liabilities: full
'English' headers, stateless transactions, full enumeration of client
capabilities once critical for debugging now inflate request packets
while progressive rendering of composite web pages places a premium on
simultaneous downloads (aggravating the share of packets wasted on
handshakes and delayed by slow-start).

Individually, these phenomena have appeared in application protocols
before -- and were nipped in the bud by active involvement by
transport protocol folks, or coevolved with TCP. When Telnet generated
a packet per keystroke, even on links where you could type faster than
the ACKs returned, the Nagle timer throttled TCP down to batching
keystrokes into a single packet per round-trip. When file transfers
immediately swamped LANs, slow-start eased up to limit congestion.

So when HTTP presents its demand for many small transfers in parallel
immediately, there are solutions on the shelf. Some vendors are
already marketing "accelerators" which rely on matched encoders at
server and client, like Sitara's SpeedServer. Unfortunately, these
approaches aren't backward compatible with HTTP as-is -- nor as simple
to code.

W3C's HTTP-NG (Next Generation) project sees it as a three-layer
problem: message transport, remote invocation, and "The Classic Web
Application" (TCWA). At the bottom, they propose a multiplexing
protocol which combines all the traffic destined for a single web
server into a single TCP connection. This allows the conversations
regarding individual transfers to be diced up independently: an
intelligent server might respond with the first few critical bytes of
all the images in the first packet. A fully specified W3Mux layer
still needs to address flow control (gating the relative priority of
the subchannels without starving them), simultaneous open (which TCP
does allow), and lots of interoperability testing.

The middle layer 'compresses' the data to be transferred by
marshalling arguments in native data types rather than English. So
numbers would be sent as integers; dates as binary fields; and strings
as reusable tokens. The tip of their proposal (reaching into the
cloudy heights of improbability, some critics might say) is
reengineering the Web as a distributed object system, modeled as
operations on data structures rather than documents.

Building the Perfect Beast

Having slain Gopher (and a host of other competitors), the Web appears
to be expanding in all directions today. Just as the SMTP state
machine defined a style or protocol design for a generation, every new
application these days seems to ape HTTP's state-less style -- if only
to piggyback a free ride through firewalls at port 80. The Internet
Architecture Board is on the verge of issuing guidelines to rein in
this tendency. First, the HTTP port should be reserved for protocols
which manipulate the kind of information already in web servers.
WebDAV is OK, since it extends documents; IPP would not, since it uses
POST as a barn door for its printing procedure calls. Second, HTTP
should not be referenced by-copy. Lots of other efforts want to
'borrow' a few headers for authentication or navigation and end up
copying text into other specs, raising the scepter of divergent
versions. Third, HTTP is not a valid precedent for MIME users -- it
plays fast and loose with certain rules which enable absolute
interoperablity through mail gateways. HTTP allows bare <CR> and <LF>
as well as inventing a new Content-Encoding: process.

The IAB guidelines are a reminder that HTTP still occupies a limited
niche in the space of possible Transfer Protocols. True, its message
format can encode any MIME entity, even live data streams -- but only
from the server to the client. True, its address format can point at
anything -- but even then it can't encode whether the resource is
expected to be secured, paid for, or otherwise reprocessed. And it is
most certainly restricted to one-to-one, synchronous, request-response
(client-pull) interaction. All the various hacks to 'push' data to
groups of subscribers are, well, hacks: automatic polling, shared
caches, and "we will mail your response within two working days".

The moral, then, for would-be sucessors to HTTP is to pick another
niche and follow the network effects. Internet-scale event
notification, for example, requires true push and subscriber
management. Current efforts to shove presence information through HTTP
may succeed, but any larger shift to event-based integration could
catalyze a new, symmetric protocol which allows servers to initiate
connections back to clients. Simply compressing existing HTTP traffic
without adding new functionality faces a powerful entrenched
competitor. Adding programmability, whether betting on HTTP-NG or the
Mandatory scheme, is an attempt to harness network effects between all
the communities that have been angling to extend HTTP to their
application. Yet, the rewards for cooperation are spread thin: only
the server which simultaneously supports HTTP, WebDAV, and Internet
Printing Protocol would gain from a common syntax for those
extensions.

And now we are left back where we started: the mystery of protocol
adoption. Reuse, extensiblity, simplicity are all tricks to reduce the
cost of implementing a new protocol -- but that doesn't mean the
market as a whole will make the most efficient decision. If you're a
Gopher, it may just bury you :-)

The morbid conceit of this article was inspired by the first-rate book
Aramis by French historian of technology @@. It told the tale of two
decades of R&D on a Parisian peoplemover system in the first person --
the train itself speaking of its struggle to become, and its death.
_________________________________________________________________

Table 1. Fingerprints of a murderer -- Spider or Gopher?

Vitals

Academic NeXTstep hack, about 8 years old; birthmark at port 70 or 80

First Victim

Notorious primary use was for mere phonebooks; gatewayed queries to
existing campus database and represented results as hypertext nodes

Disguise

Began without any meaningful type system at all, but soon grew past
file extensions to vend multimedia information, even negotiating
formats

Modus Operandi

Open a new connection, send a one-line demand for the information at a
given address, then receive the response until connection dies

Come-on

Infiltrate small workgroups with small, easy, and free tools,
promising acceptance with a simple `backlink' from a master list of
servers

Serial Victims

Reaching beyond the degenerate case of static information, its
addresses became entry points to scripts, databases, and other
`stateless' servers

Motive

Even its first manifesto laid out hegemonic plans: to incorporate, by
reference or by proxy, every other Internet information service.
_________________________________________________________________

Table 2. Gopher's rudimentary type system used character codes to prefix
selector lines. Gopher+ later added content negotiation.

0 File
1 Directory
2 CSO phone-book server
3 Error
4 BinHexed Mac file
5 DOS binary file
6 Uuencoded Unix file
7 Index-Search server
8 Text-based telnet session
9 Binary file
+ Redundant server
T Text-based tn3270 session
g GIF image file
I Any image file
_________________________________________________________________

Table 3. RFCs and Internet-Drafts discussed in this issue.

@@need to add draft-iab-using-http-00.txt to the end.

RFC

Date

Title and Comments

1436

Mar 1993

The Internet Gopher Protocol
Described a `fiercely simple' request-response protocol and menu
format

1630

June 1994

Universal Resource Identifiers
Raised the stakes from Gopher's own selectors to any namespace

1689

Aug 1994

Status Report on Networked Information Retrieval: Tools and Groups
A wonderful archaeological census of competing protocols, many now
extinct

1945

May 1996

HyperText Transfer Protocol 1.0
Also defined the more primitive 0.9 GET. Even 1.0 only added HEAD and
POST

2068

Jan 1997

HyperText Transfer Protocol 1.1
A vastly expanded contender with better caching, security, and
extensibility

2227
P. Std.

Oct 1997

Simple Hit-Metering and Usage-Limiting for HTTP
Entire RFC defined a single Meter:

2396
D. Std.

Aug 1998

Uniform Resource Identifiers: Generic Syntax
The deceptively small definitive grammar took years to hammer out

TBA
P. Std

Dec 1998

Extensions to HTTP for Distributed Authoring
Adding locks, collection-resources, and namespace operations took
major reengineering

I-Draft

Nov 1997

PEP -- an Extension Mechanism for HTTP
`Protocol Extension Protocol' had negotiation policies for pipelining
message processors

I-Draft

July 1998

HTTP State Management Mechanism
Cookies took three years from a quick, risky hack to a reasonably
secure mechanism

I-Draft

Sep 1998

HTTP Authentication: Basic and Digest
A companion specification for nonexistent (UUencoded passwords!) and
new security.

I-Draft

Nov 1998

HTTP Extension Framework
Pared down to headers which label `Mandatory' and optional extensions
of each message

I-Draft

Nov 1998

Upgrading to TLS Within HTTP/1.1
Using the Upgrade: for Transport Layer Security, in lieu of a
dedicated port like https: at 443