http://developer.netscape.com/news/viewsource/bray_xml.html
Asides...
1. I'm in so much pain from my dental surgery today. Ow. Ow.
2. Watch the South Park marathon tonight on Comedy Central, 11pm-2am!!!
BEYOND HTML: XML AND AUTOMATED WEB PROCESSING
By Tim Bray
Send comments and questions about this article to View Source.
XML (Extensible Markup Language) was nowhere a year ago; now it seems to
be everywhere. It's supposed to be the thing that "goes beyond HTML" --
but what does that mean? Since HTML is the most successful document
format in history, why would anyone want to go beyond it? The people who
are working on XML talk about "automating the Web" -- what does that
mean? XML is designed to do some jobs that HTML isn't built to handle
but that really need doing. If you just want to display text, there's
nothing wrong with HTML, but for automated Web processing -- enriching
documents in a way that enables computer programs (like Web robots) to
do something with them -- what's needed is XML.
XML was designed under the auspices of the World Wide Web Consortium
(W3C). It went public in November 1996 and is already the basis for
half a dozen proposals to automate Web processing. XML has a lot of
people thinking really hard about what the future of the Web will look
like. You need to start thinking about XML now, because a year from now
you'll undoubtedly be using it a lot.
XML is extensible, easy, and (hard to believe, but true) guaranteed not
to break your computer programs. In this article, I'll expand on these
extravagant claims. Then I'll explain where XML came from and also take
some guesses as to where it's going and what it might mean for you.
XML IS EXTENSIBLE
Extensibility is the reason for XML. HTML is great, but often it can
seem to have either too many tags or not enough. It's got too many if
you're trying to write a browser, a robot, or a general-purpose
JavaScript utility, but not enough if you want to identify a
<Part-Number>, <Exchange-Rate>, or <Aikido-Rank> in your Web page to
allow automated processing. HTML doesn't have those tags, and it isn't
going to get them. But in XML, you can make up any old tags you want to
use.
Suppose I wanted to add some intelligence to the vast amount of e-mail
stored on my computer. With XML I could mark it as shown in Example 1.
-----------------------------------------------------------------------
Example 1
<email>
<head>
<from> <name>Tim Bray</name> <address>tbray@textuality.com</address> </from>
<to> <name>Paul Dreyfus</name> <address>pdreyfus@netscape.com</address> </to>
<subject> First draft of XML intro </subject>
</head>
<body>
<p>Here's a draft of that XML article. I'll be on the road but
connected to e-mail. Let me know if it hits the right level (i.e., are
major revisions in order?). If it's fine, proceed with editorial
nit-pickery. -Tim</p>
<attach encoding=3D"mime" name=3D"xml-draft.html"/>
</body>
</email>
-------------------------------------------------------------------------
This example should be pretty obvious. The <attach> tag looks a little
weird, but we'll cover that in a moment. Some of the advantages should
also be obvious. To start with, a Web robot could do a smart job of
indexing this, and a Java applet could do all sorts of intelligent
formatting (such as build a table-of-contents summary of a bunch of
e-mail). The basic idea here is called descriptive markup: the tags
around a chunk of text don't say how to format it, or what to do when
people click on it; they just say what it is. This is in dramatic
contrast to HTML, where the tags do all these things at once.
The big win with descriptive markup is a bit subtle. Suppose you're
processing some e-mail and you want to be able to display it both with
Navigator on a big monitor and on the teeny screen of a cell phone. If
the e-mail were marked up in XML, you could write one set of rules for
the monitor and another for the cell phone, another to produce a
professional-quality paper printout, and still another to drive a fax
machine.
The idea is that you've decoupled the document from its
presentation. This doesn't make designing good documents or good
presentations easy, but it does mean that you can attack the problems
separately, which is a big step forward.
Publish and Constrain Your Tags
Obviously, you don't want to make up a new set of tags every time you
write a document. Furthermore, since this is the Web, you'd probably
like to share your work with others.
XML has something called a document type definition (usually called a
DTD) that allows you to define the tags you've created, for future use
by yourself or others. Example 2 is the DTD for the e-mail shown in the
earlier example.
------------------------------------------------------------------------
Example 2
<!element email (head, body)>
<!element head (from, to+, cc*, subject)>
<!element from (name?, address)>
<!element to (name?, address)>
<!element name (#PCDATA)>
<!element address (#PCDATA)>
<!element subject (#PCDATA)>
<!element body (p | attach)*>
<!element p (#PCDATA)>
<!element attach EMPTY>
<!attlist attach encoding (mime|binhex) "mime"
name=
CDATA #REQUIRED>
-------------------------------------------------------------------------
This should be easy to read, too. In English, it says:
1. An EMAIL has to have a HEAD and a BODY.
2. The HEAD has to have a FROM, one or more TOs, zero or more CCs, and
a SUBJECT.
3. The FROM and the TO can both include a NAME, and they have to
include an ADDRESS.
4. The NAME, ADDRESS, and SUBJECT are all just text.
5. The BODY is a mixture of Ps and ATTACHes.
6. A P contains just text.
7. An ATTACH doesn't contain anything, but it has an ENCODING
attribute whose value can be either mime or binhex; if it's not there,
the default is mime.
8. An ATTACH also has a NAME attribute whose value can be any text,
but has to be there.
I'm not going to explain all the details of the DTD syntax, but the
ideas are pretty obvious. Clearly, you'd normally have one DTD that
describes a lot of different documents; think of it as an SQL database
schema for documents.
If this DTD were stored at some location -- say,
http://home.netscape.com/DTDs/email.dtd -- then to associate the DTD
with the e-mail message you'd insert a first line like this:
<!doctype email SYSTEM "http://home.netscape.com/DTDs/email.dtd">
<email>
<head>
<from> <name>Tim Bray</name> <email>tbray@textuality.com</email> </from>
<to> <name>Paul Dreyfus</name> <email>pdreyfus@netscape.com</email> </to>
...The DTD might be useful to a program that received one of these
e-mail messages and wanted to find out in advance what tags would be in
it and how they fit together. But its most important use is to support
smart editing programs, which could read the DTD and simply not let the
author create a document that didn't match the DTD. (This isn't
imaginary; such authoring tools already exist.)
An XML document for which there is a DTD, and which conforms to that
DTD, is called valid. But a document doesn't have to be valid to be
useful, as we'll see in a moment.
Extensible Hyperlinks, Too
Adding your own tags is nice, but that's only part of what makes the Web
useful and XML interesting. Hyperlinks make the Web go; the <A
HREF="whatever"> idiom has become universal. XML extends Web
hyperlinks in a couple of useful directions. Example 3 is taken from a
description of a tournament game of Go (which is an old, complex,
popular Asian board game, Sakata being one of the most famous players of
this century).
-------------------------------------------------------------------------
Example 3
<P>Faced with a tight situation, Sakata found a
<X><L ROLE="EG" TITLE="English translation"
SHOW="NEW" HREF="/cgi-bin/xlate?term=tesuji" />
<L ROLE="ToMove" TITLE="Jump to move in game record"
SHOW="REPLACE" HREF="game.html#Move127" />
<L ROLE="PIC" TITLE="Illustration"
SHOW="EMBED"
HREF="pix.xml#DESCENDENT(1,FIG,CAPTION,TESUJI)" />
<L ROLE="CourseNotes" TITLE="Course Notes"
HREF="notes.xml#ID(def-Tesuji)..DITTO,NEXT(3,P)" />
tesuji</X>.</P>
--------------------------------------------------------------------------
Once again, we'll skip the syntactic details, which are explained in the
Linking part of the XML Specification. In a browser, this would look
something like:
Faced with a tight situation, Sakata found a tesuji.
When you clicked on "tesuji," though, instead of the usual Web behavior
of charging off after that link, you'd get a menu with four entries:
English translation, Jump to move in game record, Illustration, and
Course Notes.
Choosing English translation would run an ordinary CGI script. The
attribute SHOW="NEW" means that rather than replacing the current page,
the results of the script would show up in a new window (as if you'd
said TARGET=_NEW in an HTML page). By the way, the translation would
reveal that "tesuji" is a Go term meaning a clever tactical maneuver.
Jump to move in game record, a link into an HTML page, would behave
exactly as the Web does today.
The Illustration option is more interesting. First of all, it's a link
into an XML file. The text after the # in the URL says that the link is
to the first FIG element that has the attribute CAPTION="TESUJI". Also,
because of the attribute SHOW="EMBED", rather than replacing the current
page with the target of the link, that target material would be inserted
in the display right here at the location of the link.
The Course Notes option links to a "span" of text in an XML file --
specifically, the first three paragraphs following a tag that has the
attribute ID="def-Tesuji".
These straightforward extensions of the Web's current linking
facilities, in my opinion, add a lot of richness and cost very
little. (But then, I helped write the spec.)
XML IS EASY
Most standards, even popular ones, never get read by most people. How
many of us, for example, have actually read the basic HTML or TCP/IP
specs, or even the electrical standards that allow you to plug a toaster
in safely? The XML Specification, on the other hand, is short (less
than 40 pages), and since it was designed for use by programmers, most
readers of this article will find it straightforward.
The XML spec is available not only in HTML but also in RTF, PostScript,
and PDF versions. This was easy to arrange because the spec is actually
written in XML; all the other versions were auto-generated with a
variety of formatting systems. (Remember our discussion above about the
advantages of decoupling markup from a particular formatting semantic?)
For programmers, the HTML version is probably the most helpful. All the
special terms are linked to their definitions, and all the
"nonterminals" on the right-hand side of grammar productions are linked
to their definitions. If you want to sit down and read the spec
end-to-end, paper is the way to go.
XML had a design goal that it should be easy enough for a smart
programmer to whip up a parser in a week. Since it was announced in
November 1996, a ton of parsers have been whipped up. The one I wrote,
named Lark, took a bit more than a week, but I was traveling when I
wrote it. Besides, I didn't know Java when I started and had to learn it
as I went along. The Java class files for Lark are only about 40K, and
it does most of XML, with good error messages. This is simple stuff.
Elements, Tags, and Attributes
An XML document is made up of elements. Most elements have a start-tag,
which may contain attributes, and an end-tag. Example 4 illustrates the
XML terms element, element type, content, start-tag, end-tag, attribute
name, attribute value, and empty element.
-------------------------------------------------------------------------
Example 4
<p secret="false">This sentence is in the content of an
element whose type is "p"; the content is found between the
start-tag and the end-tag. The paragraph has an attribute named "secret"
whose value is "false". <IMG SRC='madonna.jpg'/> is an
empty element, distinguished by the fact that it ends with "/>".</p>
-------------------------------------------------------------------------
That's about all there is to it. The only thing that will look a little
weird to Web-folk is the <IMG> tag ending in />. (In HTML the <IMG> tag
just ends with > like any other tag.)
The /> is important. Since any HTML processor "just knows" that <IMG> is
an empty tag (one that doesn't depend on enclosing text and so doesn't
have an end tag), no special syntax is required. But since in XML you
can invent your own tags, empty elements (having no end-tags) need
special syntax to keep parsers from getting confused. The /> trick
allows simple programs to parse documents without knowning anything
about them in advance.
Entities
XML documents don't have to live in a single file; they can be made up
of multiple pieces, called entities. Example 5 illustrates entities in
the master document for a short book.
-------------------------------------------------------------------------
Example 5
<!doctype book SYSTEM "book.dtd"
[
<!entity toc SYSTEM "toc.xml">
<!entity chap1 SYSTEM "chapters/c1.xml">
<!entity chap2 SYSTEM "chapters/c2.xml">
]>
<book><head>&toc;</head>
<body>
&chap1;
&chap2;
</body></book>
-------------------------------------------------------------------------
In this case, the table of contents and chapters live in separate files
(well, really, those are URLs), so different people can work on them in
parallel. These are called external entities because their content is
outside the main document.
Entities can also be used for reusable text and for referring to
characters that are hard to type on the keyboard. Example 6 declares an
entity that expands to the text "Extensible Markup Language." It also
uses entities referring to characters that are different versions of the
number "1"; these don't need to be declared since they're just
numbers. These numbers come from the Unicode standard for international
character sets. (This is as good a place as any to let you know that XML
comes globalized: you can use any Unicode character in XML.)
-------------------------------------------------------------------------
Example 6
<!doctype eg
[
<!entity xml "Extensible Markup Language">
]>
<eg>The new &xml; standard is fully internationalized; the following
are all examples of the digit "1": 1 (in ASCII),
١ (in Devanagari), १ (in Arabic), and
൧ (in Malayalam).</eg>
-------------------------------------------------------------------------
Being Well-Formed
Now we're ready for one of XML's most important concepts: that of the
well-formed document. This just means that:
1. All the tags are there.
2. The begin- and end-tags match (with the exception of empty
elements, which can use the /> trick to skip the end-tag).
3. All the attribute values are quoted.
4. All the entities are declared.
All the examples so far have been well-formed, but the following (lousy,
but usable) HTML document isn't:
<title>Reasonable HTML</title>
Some text, and I <i>really</i> don't want a line break
between "line" and "break".
<p>Here's a picture: <IMG src=madonna.jpg>
The problems with this are:
1. There's no "root" element to enclose the whole thing (should be
... ).
2. The entity nbsp is used without being declared.
3. There's a
with no corresponding
. 4. The