XML and Automated Web Processing.

I Find Karma (adam@milliways.cs.caltech.edu)
Fri, 19 Sep 1997 17:04:40 -0700 (PDT)


Rohit, why didn't you FoRK this?

http://developer.netscape.com/news/viewsource/bray_xml.html

Asides...
1. I'm in so much pain from my dental surgery today. Ow. Ow.
2. Watch the South Park marathon tonight on Comedy Central, 11pm-2am!!!

BEYOND HTML: XML AND AUTOMATED WEB PROCESSING
By Tim Bray

Send comments and questions about this article to View Source.

XML (Extensible Markup Language) was nowhere a year ago; now it seems to
be everywhere. It's supposed to be the thing that "goes beyond HTML" --
but what does that mean? Since HTML is the most successful document
format in history, why would anyone want to go beyond it? The people who
are working on XML talk about "automating the Web" -- what does that
mean? XML is designed to do some jobs that HTML isn't built to handle
but that really need doing. If you just want to display text, there's
nothing wrong with HTML, but for automated Web processing -- enriching
documents in a way that enables computer programs (like Web robots) to
do something with them -- what's needed is XML.

XML was designed under the auspices of the World Wide Web Consortium
(W3C). It went public in November 1996 and is already the basis for
half a dozen proposals to automate Web processing. XML has a lot of
people thinking really hard about what the future of the Web will look
like. You need to start thinking about XML now, because a year from now
you'll undoubtedly be using it a lot.

XML is extensible, easy, and (hard to believe, but true) guaranteed not
to break your computer programs. In this article, I'll expand on these
extravagant claims. Then I'll explain where XML came from and also take
some guesses as to where it's going and what it might mean for you.

XML IS EXTENSIBLE

Extensibility is the reason for XML. HTML is great, but often it can
seem to have either too many tags or not enough. It's got too many if
you're trying to write a browser, a robot, or a general-purpose
JavaScript utility, but not enough if you want to identify a
<Part-Number>, <Exchange-Rate>, or <Aikido-Rank> in your Web page to
allow automated processing. HTML doesn't have those tags, and it isn't
going to get them. But in XML, you can make up any old tags you want to
use.

Suppose I wanted to add some intelligence to the vast amount of e-mail
stored on my computer. With XML I could mark it as shown in Example 1.

-----------------------------------------------------------------------
Example 1
<email>
<head>
<from> <name>Tim Bray</name> <address>tbray@textuality.com</address> </from>
<to> <name>Paul Dreyfus</name> <address>pdreyfus@netscape.com</address> </to>
<subject> First draft of XML intro </subject>
</head>
<body>
<p>Here's a draft of that XML article. I'll be on the road but
connected to e-mail. Let me know if it hits the right level (i.e., are
major revisions in order?). If it's fine, proceed with editorial
nit-pickery. -Tim</p>
<attach encoding=3D"mime" name=3D"xml-draft.html"/>
</body>
</email>
-------------------------------------------------------------------------

This example should be pretty obvious. The <attach> tag looks a little
weird, but we'll cover that in a moment. Some of the advantages should
also be obvious. To start with, a Web robot could do a smart job of
indexing this, and a Java applet could do all sorts of intelligent
formatting (such as build a table-of-contents summary of a bunch of
e-mail). The basic idea here is called descriptive markup: the tags
around a chunk of text don't say how to format it, or what to do when
people click on it; they just say what it is. This is in dramatic
contrast to HTML, where the tags do all these things at once.

The big win with descriptive markup is a bit subtle. Suppose you're
processing some e-mail and you want to be able to display it both with
Navigator on a big monitor and on the teeny screen of a cell phone. If
the e-mail were marked up in XML, you could write one set of rules for
the monitor and another for the cell phone, another to produce a
professional-quality paper printout, and still another to drive a fax
machine.

The idea is that you've decoupled the document from its
presentation. This doesn't make designing good documents or good
presentations easy, but it does mean that you can attack the problems
separately, which is a big step forward.

Publish and Constrain Your Tags

Obviously, you don't want to make up a new set of tags every time you
write a document. Furthermore, since this is the Web, you'd probably
like to share your work with others.

XML has something called a document type definition (usually called a
DTD) that allows you to define the tags you've created, for future use
by yourself or others. Example 2 is the DTD for the e-mail shown in the
earlier example.

------------------------------------------------------------------------
Example 2
<!element email (head, body)>
<!element head (from, to+, cc*, subject)>
<!element from (name?, address)>
<!element to (name?, address)>
<!element name (#PCDATA)>
<!element address (#PCDATA)>
<!element subject (#PCDATA)>
<!element body (p | attach)*>
<!element p (#PCDATA)>
<!element attach EMPTY>
<!attlist attach encoding (mime|binhex) "mime"
name=
CDATA #REQUIRED>
-------------------------------------------------------------------------

This should be easy to read, too. In English, it says:

1. An EMAIL has to have a HEAD and a BODY.
2. The HEAD has to have a FROM, one or more TOs, zero or more CCs, and
a SUBJECT.
3. The FROM and the TO can both include a NAME, and they have to
include an ADDRESS.
4. The NAME, ADDRESS, and SUBJECT are all just text.
5. The BODY is a mixture of Ps and ATTACHes.
6. A P contains just text.
7. An ATTACH doesn't contain anything, but it has an ENCODING
attribute whose value can be either mime or binhex; if it's not there,
the default is mime.
8. An ATTACH also has a NAME attribute whose value can be any text,
but has to be there.

I'm not going to explain all the details of the DTD syntax, but the
ideas are pretty obvious. Clearly, you'd normally have one DTD that
describes a lot of different documents; think of it as an SQL database
schema for documents.

If this DTD were stored at some location -- say,
http://home.netscape.com/DTDs/email.dtd -- then to associate the DTD
with the e-mail message you'd insert a first line like this:

<!doctype email SYSTEM "http://home.netscape.com/DTDs/email.dtd">
<email>
<head>
<from> <name>Tim Bray</name> <email>tbray@textuality.com</email> </from>
<to> <name>Paul Dreyfus</name> <email>pdreyfus@netscape.com</email> </to>

...The DTD might be useful to a program that received one of these
e-mail messages and wanted to find out in advance what tags would be in
it and how they fit together. But its most important use is to support
smart editing programs, which could read the DTD and simply not let the
author create a document that didn't match the DTD. (This isn't
imaginary; such authoring tools already exist.)

An XML document for which there is a DTD, and which conforms to that
DTD, is called valid. But a document doesn't have to be valid to be
useful, as we'll see in a moment.

Extensible Hyperlinks, Too

Adding your own tags is nice, but that's only part of what makes the Web
useful and XML interesting. Hyperlinks make the Web go; the <A
HREF="whatever"> idiom has become universal. XML extends Web
hyperlinks in a couple of useful directions. Example 3 is taken from a
description of a tournament game of Go (which is an old, complex,
popular Asian board game, Sakata being one of the most famous players of
this century).

-------------------------------------------------------------------------
Example 3
<P>Faced with a tight situation, Sakata found a
<X><L ROLE="EG" TITLE="English translation"
SHOW="NEW" HREF="/cgi-bin/xlate?term=tesuji" />
<L ROLE="ToMove" TITLE="Jump to move in game record"
SHOW="REPLACE" HREF="game.html#Move127" />
<L ROLE="PIC" TITLE="Illustration"
SHOW="EMBED"
HREF="pix.xml#DESCENDENT(1,FIG,CAPTION,TESUJI)" />
<L ROLE="CourseNotes" TITLE="Course Notes"
HREF="notes.xml#ID(def-Tesuji)..DITTO,NEXT(3,P)" />
tesuji</X>.</P>
--------------------------------------------------------------------------

Once again, we'll skip the syntactic details, which are explained in the
Linking part of the XML Specification. In a browser, this would look
something like:

Faced with a tight situation, Sakata found a tesuji.

When you clicked on "tesuji," though, instead of the usual Web behavior
of charging off after that link, you'd get a menu with four entries:
English translation, Jump to move in game record, Illustration, and
Course Notes.

Choosing English translation would run an ordinary CGI script. The
attribute SHOW="NEW" means that rather than replacing the current page,
the results of the script would show up in a new window (as if you'd
said TARGET=_NEW in an HTML page). By the way, the translation would
reveal that "tesuji" is a Go term meaning a clever tactical maneuver.
Jump to move in game record, a link into an HTML page, would behave
exactly as the Web does today.

The Illustration option is more interesting. First of all, it's a link
into an XML file. The text after the # in the URL says that the link is
to the first FIG element that has the attribute CAPTION="TESUJI". Also,
because of the attribute SHOW="EMBED", rather than replacing the current
page with the target of the link, that target material would be inserted
in the display right here at the location of the link.

The Course Notes option links to a "span" of text in an XML file --
specifically, the first three paragraphs following a tag that has the
attribute ID="def-Tesuji".

These straightforward extensions of the Web's current linking
facilities, in my opinion, add a lot of richness and cost very
little. (But then, I helped write the spec.)

XML IS EASY

Most standards, even popular ones, never get read by most people. How
many of us, for example, have actually read the basic HTML or TCP/IP
specs, or even the electrical standards that allow you to plug a toaster
in safely? The XML Specification, on the other hand, is short (less
than 40 pages), and since it was designed for use by programmers, most
readers of this article will find it straightforward.

The XML spec is available not only in HTML but also in RTF, PostScript,
and PDF versions. This was easy to arrange because the spec is actually
written in XML; all the other versions were auto-generated with a
variety of formatting systems. (Remember our discussion above about the
advantages of decoupling markup from a particular formatting semantic?)

For programmers, the HTML version is probably the most helpful. All the
special terms are linked to their definitions, and all the
"nonterminals" on the right-hand side of grammar productions are linked
to their definitions. If you want to sit down and read the spec
end-to-end, paper is the way to go.

XML had a design goal that it should be easy enough for a smart
programmer to whip up a parser in a week. Since it was announced in
November 1996, a ton of parsers have been whipped up. The one I wrote,
named Lark, took a bit more than a week, but I was traveling when I
wrote it. Besides, I didn't know Java when I started and had to learn it
as I went along. The Java class files for Lark are only about 40K, and
it does most of XML, with good error messages. This is simple stuff.

Elements, Tags, and Attributes

An XML document is made up of elements. Most elements have a start-tag,
which may contain attributes, and an end-tag. Example 4 illustrates the
XML terms element, element type, content, start-tag, end-tag, attribute
name, attribute value, and empty element.

-------------------------------------------------------------------------
Example 4
<p secret="false">This sentence is in the content of an
element whose type is "p"; the content is found between the
start-tag and the end-tag. The paragraph has an attribute named "secret"
whose value is "false". <IMG SRC='madonna.jpg'/> is an
empty element, distinguished by the fact that it ends with "/>".</p>
-------------------------------------------------------------------------

That's about all there is to it. The only thing that will look a little
weird to Web-folk is the <IMG> tag ending in />. (In HTML the <IMG> tag
just ends with > like any other tag.)

The /> is important. Since any HTML processor "just knows" that <IMG> is
an empty tag (one that doesn't depend on enclosing text and so doesn't
have an end tag), no special syntax is required. But since in XML you
can invent your own tags, empty elements (having no end-tags) need
special syntax to keep parsers from getting confused. The /> trick
allows simple programs to parse documents without knowning anything
about them in advance.

Entities

XML documents don't have to live in a single file; they can be made up
of multiple pieces, called entities. Example 5 illustrates entities in
the master document for a short book.

-------------------------------------------------------------------------
Example 5
<!doctype book SYSTEM "book.dtd"
[
<!entity toc SYSTEM "toc.xml">
<!entity chap1 SYSTEM "chapters/c1.xml">
<!entity chap2 SYSTEM "chapters/c2.xml">
]>
<book><head>&toc;</head>
<body>
&chap1;
&chap2;
</body></book>
-------------------------------------------------------------------------

In this case, the table of contents and chapters live in separate files
(well, really, those are URLs), so different people can work on them in
parallel. These are called external entities because their content is
outside the main document.

Entities can also be used for reusable text and for referring to
characters that are hard to type on the keyboard. Example 6 declares an
entity that expands to the text "Extensible Markup Language." It also
uses entities referring to characters that are different versions of the
number "1"; these don't need to be declared since they're just
numbers. These numbers come from the Unicode standard for international
character sets. (This is as good a place as any to let you know that XML
comes globalized: you can use any Unicode character in XML.)

-------------------------------------------------------------------------
Example 6
<!doctype eg
[
<!entity xml "Extensible Markup Language">
]>
<eg>The new &xml; standard is fully internationalized; the following
are all examples of the digit "1": &#49; (in ASCII),
&#x0661; (in Devanagari), &#x0967; (in Arabic), and
&#x0d67; (in Malayalam).</eg>
-------------------------------------------------------------------------

Being Well-Formed

Now we're ready for one of XML's most important concepts: that of the
well-formed document. This just means that:

1. All the tags are there.
2. The begin- and end-tags match (with the exception of empty
elements, which can use the /> trick to skip the end-tag).
3. All the attribute values are quoted.
4. All the entities are declared.

All the examples so far have been well-formed, but the following (lousy,
but usable) HTML document isn't:

<title>Reasonable HTML</title>
Some text, and I <i>really</i> don't want a line&nbsp;break
between "line" and "break".
<p>Here's a picture: <IMG src=madonna.jpg>

The problems with this are:

1. There's no "root" element to enclose the whole thing (should be
... ). 2. The entity nbsp is used without being declared. 3. There's a

with no corresponding

. 4. The tag is missing the closing />, so the XML parser can't tell it's supposed to be empty. 5. The value of the src attribute, madonna.jpg, should be quoted but isn't. There are a few more syntactic details to being well-formed, but these are the important ones. Well-formed documents are easy to parse, even for the tiniest applets. XML IS GUARANTEED Since browsers are so forgiving of bad HTML, there's a lot of bad HTML out there, which makes it hard to do automated processing with any reliability. (You could do it if you were to write as much code as there is in Netscape Navigator, but you don't want to write that much code!) Fortunately, XML comes with a built-in solution. The XML spec says, very clearly, that if a document is supposed to be XML but isn't well-formed, then it's toast. That is to say, no conformant XML processor is allowed to recover, to go on and try to guess what the author meant. The idea is, basically, that it's pretty easy to make documents well-formed, the rewards for doing so are very high, and anybody who doesn't bother is a bozo whose material should be ignored anyway. This was a controversial decision, but it was one that both Netscape and Microsoft demanded of the XML committee. HTML means never having to say you're sorry, which is just fine for lightweight low-overhead publishing, but a really lousy basis for trying to automate the Web. This decision won't change the way people work -- authors will continue to publish any old thing, no matter how bad, as long as it looks good in Navigator -- but when you're publishing XML, XML's error-handling rules will guarantee that once Navigator displays something, you can be sure it's well-formed. THE HISTORY OF XML If you look under the covers, it turns out that XML is actually a simplified form of SGML (Standard Generalized Markup Language). SGML is a big, complicated ISO standard that has been used to define HTML and lots of other languages. While SGML is a useful tool in a lot of industrial applications, it's just too complicated for Joe Homepage. XML was cooked up by a combination of old publishing hacks who like SGML and Web-heads (some with IPOs under their belts) who understand how the Web works. Some of us are old publishing hacks and Web-heads at the same time. Basically, XML is SGML with the hard bits thrown out, explained simply and straightforwardly. We started in July 1996 and published the first draft in November 1996. The first parser appeared in January 1997. The first applications started bubbling up in March 1997, and now they're springing up like mushrooms everywhere. Is this living in Internet Time or what? THE FUTURE XML will probably be an official World Wide Web Consortium (W3C) recommendation by the end of 1997. Its already being applied in a vaiety of ways, including: 1. RDF (Resource Description Framework) -- still under develoment in the W3C, a framework for general-purpose Web metadata that's supported by Netscape and a variety of other companies. (Metadata is information about information: datestamps, security, subject headings, content ratings, Web maps, copyright notices, and the like. Right now, the Web doesn't have any metadata, but needs it terribly.) While the picture is a little hazy, a couple of other XML-based proposals, Microsoft's CDF (Channel Definition Format) and Marimba/Netscape's DRP (Distribution and Replication Protocol), will probably fit into the RDF framework. 2. OFX (Open Financial Exchange) -- the format used by Intuit Quicken and Microsoft Money to talk to banks. 3. CML (Chemical Markup Language) -- invented in Britain for chemists to interchange descriptions of molecules, formulas, and other chemical arcana. 4. MML (Mathematical Markup Language) -- more W3C work, designed to support the unglamorous (but commercially important) job of typesetting mathematics. 5. OSD (Open Software Distribution) from Marimba and Microsoft. All this is very nice, but it's not what really turns the cranks of the XML people. We're waiting for the day when Navigator will be able to display XML natively, driven by a selection of style sheets, powered by Java applets running in the browser, and fueled by rich, user-defined document structures. That day isn't here it, but it will be, sooner than you think. ------------------------------------------------------------------------- FURTHER RESOURCES The XML Headquarters at the World Wide Web Consortium The XML FAQ at the University College of Cork, in Ireland The XML area of the SGML Web Page at the Summer Institute of Linguistics ------------------------------------------------------------------------- Many thanks to Lauren Wood for support and sanity checks. Tim Bray is a Canadian who has been working with computerized documents since joining the Oxford English Dictionary project in 1986. He co-founded Open Text Corporation, wrote one of the first big Web robots, and since 1996 has had a consulting practice under the name Textuality. He is a Seybold Fellow, edits The Gilbane Report, and is co-editor of the W3C XML specification. Tim represents Netscape (on a consulting basis) in the XML process, but does not speak for anyone but himself, and nothing in this article should necessarily be taken as representing Netscape's opinion.= -------------------------------------------------------------------------- Copyright 1997 Netscape Communications Corporation ---- adam@cs.caltech.edu A man has a right to unrestricted liberty of discussion. Falsehood is a scorpion that will sting itself to death. -- Percy Bysshe Shelley, "A Declaration of Rights"