[BITS] YML - Why Markup Language...

I Find Karma (adam@cs.caltech.edu)
Mon, 23 Feb 1998 07:14:56 -0800


Roll Yer'own Markup Language: an Existential Experiment, or:

Why Markup Languages, Anyway ?

A Gedankenexperiment by Rohit (and Adam)

======

We're interested in the possiblities for mucking with the concrete
syntax for representing XML documents on the wire and on disk. Remember,
even without getting into the esoterica of our later ideas, there's
something to be said about the wasted bytes of <VehicleModel>...
</VehicleModel> vs. <VehicleModel>...</>, because computers already know
how to balance braces. To say nothing of defining a dictionary with
VehicleModel=0x1f and using <0x1f>...</>. Or, if you'd like to
interleave XML with other arbitrary binary data streams, it would have
to be escaped, byte-by-byte in an XML marked section; there's no generic
hook for "the next X bytes are opaque; don't touch!" And finally, since
the requirement for SGML compatibility mandates a prologue with the
Doctype and head elements, two XML streams are not literall composable,
nestable one inside the other. The prologues and DTD namespaces of the
*entire* document have to be merged.

So ask:

- What if your XML didn't have to be ISO-SGML compliant?

- That is, how would you design a machine-readable,
machine-processible markup language if you knew nothing about SGML?

- Alternatively, what kind of markup would you create solely for
on-the-wire use? (i.e., the question of tree transformation in time)

Hypotheses:

- XML was written using the SGML DTD -- so the "26 page XML spec"
actually requires knowledge of the 500+ page SGML spec as well

- YML should be written with Backus Naur Form

- YML should be inspired by the internal structures of DOM
(e.g., tree traversal operators)

The big question, of course, is: will YML be human-readable?
All YML documents must be transformable directly into XML,
but XML validation is not the same as YML validation (because
different algorithms would have to be used).

Perhaps a good place to start is Bert Bos' "Simple XML":

http://www.w3.org/XML/simple-XML.html

off his "XML notes" page:

http://www.w3.org/XML/notes.html

Although you won't get very far because there isn't anything else there
publicly available. I'd love to learn more about his experimente with
writing a BNF for XML.

========================================================================

Four possible positions about concrete syntax issues in XML.

0. Concrete syntax should be application specific. Clearly a matter
of private understanding.

1. Let the compression layer deal with it -- compression is faster
than disks or networks are. Results from slim binaries @ UCI to show
that that can be done quickly. You can decompress and *compile* a
smaller file off disk or off network *faster* than waiting to d/l the
native version.

2. We need a generic tokenized XML. A lame escaped encoding of XML.
Can do either a Huffman coding, or X atoms or Lisp-atoms. Define a
table 'here are the strings I will use in my document' - and use those
escapes instead of angle brackets. Machine parsable, but decidedly
unreadable. But compact -- even the dictionary for a common type could
be externalized through another dictionary URL.

3. Worry about issues of space-time layout of XML. How to represent
the tree structure to lay out the file to lay out chunks of the tree
within the stream. Important because of low latency issues. Depth
first (static, local, index), breadth first (online-progressive
rendering).

#1 and #2 are issues of degree. #1 keeps the layer separation much
cleaner.

Then the question is #3, what kinds of layout do we want to do? High
level: we don't need to fix this in the encoding - we can already use
XLL link model to do layout within a transaction. The HTTP transaction
can pre-push the subsidiary chunks in the most useful
(application-specific) ordering.

Argument in favor of #2 -- need a machine-proof encoding format.
Doesn't deal with ambiguities of whitespace, etc. How to make the thing
in core have less impedence with the thing on disk? Tradeoff: the goal
of a truly archival format is weakened, perhaps fatally.

In yet other scenarios, XML may not be adopted unless compact form. No
room for full gzip stack inside a light bulb, or some other ROM-based
thing..

HDML now has a compact encoding... will have to look into it more.

Arbitrary deferment - ANY chunk of XML can be done offline by replacing
it with a reference. For example, in a form, with a set of attributes
for choices -- could make list of choices a resource.

Some of our other concerns are no longer urgent since the XML Namespaces
work is moving forward swiftly.

Composability is still a tough goal with the SGML heritage and
namespaces up in the air. Our hope is:

any fragment should be a complete document, and
any document should be a complete fragment

Just Sartre passing the time, and the bottle,
Rohit Khare
Adam Rifkin

----
adam@cs.caltech.edu

Talking about music is like dancing about architecture.
-- Laurie Anderson