Baked XML With Sour Grapes and ArChives [Draft -1 !!!]

Rohit Khare (khare@mci.net)
Wed, 25 Jun 1997 01:30:19 -0400


This is a multi-part message in MIME format.

------=_NextPart_000_01BC8107.57EAEFA0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Baked XML With Sour Grapes and ArChives
Rohit Khare, 6/25/97 * brought to you by $3.41 worth of chai * 1200 =
words in 50 minutes

Dan Connolly suggested codifying the argument I have bandied about as a =
pet theory for a few years now: XML is an ideal substrate for archiving =
the state of distributed systems. In this case, I mean distributed in =
the sense of 'across organizational boundaries' more than I mean 'across =
address spaces' (though there's some of that, too).
Let me draw a picture, then we'll go back over the theory (as meager as =
it is). Let's recapitulate the most basic case: archiving a network of =
dependent objects within one application. Suppose we have a human =
resources application with Employee, Department, and Manager::Employee =
classes and assorted instances thereof. Let's consider what happens, at =
first without considering HTML/XML at all.
As a zeroth attempt, you might 'pickle' a Department by simply writing =
down its data structure in memory. Immediately, though, we see that =
there are pointers to a Deparment's Manager, so we actually need to =
package both objects together to make a complete statement. If you =
wanted to get all the Employees within a department, too, then you =
discover the general case. Since there is a web of employees, all =
related to each other because some are managers, we cannot simply write =
down the Department, the Manager, and then each Employee reporting to =
that manager: we could fall off the edge into a cyclic loop. In fact, =
the general case is mark-and-sweep: first you trace out every object =
connected to the one at hand; then collect and serialize every affected =
object in that subset. [@@expn could be clarified through superclass =
write: methods instead (you may know what friends you need pickled, but =
not what your superclass implementation might be using.]
This is expensive! Before you can even put the first byte on the wire, =
you have to plan out the entire pickle. It's fragile, too, because two =
subsequent snapshots may yield separate values for some subsets of the =
archive: some reporting roles change, etc. (i.e. duplicate copies of the =
state of an object in two archives).
Now, wait: you actually know a radically different way of solving this =
problem under your nose: transferring a Web page! A page has many =
subsidiary resources, some of which load other subparts in turn; some of =
which are shared with other pages, and so on. But WE don't have to =
pickle: HTTP servers don't grovel over home pages and send out neatly =
packaged bundle of =
html-with-all-embedded-images-and-sounds-in-one-MIME-multipart. We have =
the miracle of names! Instead of expensive marshalling burdens on the =
server (writer), we just send over the one object at hand with names as =
pointers to other resources. Then, let the client (reader) pull what =
they need to build a complete map... the delay bottleneck goes away, so =
we can really stream these puppies asap. [Now, of course, as a =
performance optimization, we can pipeline the next few employee records =
you'll need to mask the underlying latency... cache push]
Lesson 1: URLs are an excellent way to capture distributed state. =
Versioning, security, etc, can be layered on top of that mechanism. =
There are several red-herring issues: space used by long URLs (compress =
the transport), fragility of locations (bzzt! Locations =3D=3D names).
Now, let's look at what you put in those pickles anyway. In traditional =
speak, there are "primitives" : number, string, float, char, etc. This =
is even enshrined in IDL, earlier in many others as well. Even enshrined =
on the wire (e.g. XDR). So you end up with binary soup. Now, you and I, =
we exchange structured data all the time. Purchase orders, parking =
tickets, business cards: i.e. DOCUMENTS. The information in a data =
structure is trivially equivalent to a document, but also equivalent at =
a more profound level. I could make a document =
"@*OU9.RohitKhar3.MCI10.6179605131@"... note the length fields: in a =
binary stream, heck, you can't even rely on the length of a length field =
-- ASN.1 inanity highlights this (in fact, it should be introduced =
earlier) -- it has length-of-length and so on. But, the document =
"<H1.name>Rohit Khare " using style sheets conveys the same information =
in a human-readable form.
Lesson 2: Documents are an extremely convenient way of encoding object =
state in a way that's usable to humans and computers and is more =
palatable, reusable, non-fragile than binary formats.
Now, what if that document could have the MEANING of name, company, etc, =
instead of mere formatting hints. That is to say, what if the convention =
for exchanging these virtual WebCards was not "the name will be near the =
front in an H1", but rather "it will be in a<NAME> </NAME> element". =
Well, we can't add new tags to HTML: 1) incompatible (i.e. mechanically =
SGML is broken so that new elements are not generally possible) 2) =
impossible to document how-to-render (display or not? What font? =
Break-before? Break-after? Mutex with some other element?) and 3) =
ambiguous semantics (whose NAME? First-last or last, first, etc? =
multiculturalism?). Well, that's what DTDs are for (insert "word means =
what I wish it to mean" quote). And the beauty of *XML* is that I don't =
need to compile new programs to process each DTD: I really CAN =
dynamically learn about new document types. XML fixes bugs in SGML. XML =
adds a real naming scheme for DTDs.
Lesson 3: documents with self-describing grammars are infinitely more =
reusable than ad-hoc encodings. (i.e. motivation of *MLs)
But now, we have only defined our precise meaning: nothing lets us share =
WebCard semantics yet. What we need is a way to equivalence my vCard to =
your WebCard: a filter between XML DTDs. Traditionally, we know filters =
as converting between encoding types (e.g. jpg to png to asciiart). DTD =
calculus lets us *coordinate across administrative domains* (each having =
its own worldview). This brings synergy to the table from ad-hocracy. =
Instead of being held prisoner by some industrywide megaproject to =
define 'employee' and 'purchase order' like the OMG is doing, we can let =
it emerge organically.
Lesson 4: declaring what we mean instead of operational encoding =
(programs like Java), we can deterministically transform data with high =
fidelity.
Look, Documents are Archived Data. We create them by pickling programs =
(the output of CGIs), and we can extract pickles back out through =
deterministic reverse-engineering (webMethods' package tracker). Most of =
all, the evolutionary advantage is that *combo* human/mr documents will =
be most powerful at focusing attention. People will invest more to make =
the 'purchase order' forms look good, collect the right data, and =
generally sweat the details which they won't for relational DB table =
schemas.
Lesson 5: human-readable documents form an excellent common ground for =
many ways to generate and extract data of that kind. There are millions =
of employee databases in gadzooolians of languages, but a single pretty =
common framework for business cards (admittedly with a thousand =
variations in the graphic details -- but that's an XML dtd with many CSS =
sheets).
So, the best way to archive the state of a distributed computation like =
the business of an entire corporation is its intranet web. And the best =
way to pickle the world is the Web. The best content for it is XML. The =
pointers we get with XML are also more powerful and finer-grained so =
they are better marshalling tools. This is a protocol issue, too: =
caching of object state becomes a visible, soluble issue (it's swept =
under the rug in rpc/corba/dcom systems instead).
<theoretical discussion lining up XML against classical OOP =
serialization paper's taxonomy>
=20
=20
=20

------=_NextPart_000_01BC8107.57EAEFA0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML 3.2//EN">

Baked XML With Sour Grapes and ArChives

Rohit Khare, 6/25/97 * brought to you by $3.41 worth of chai * 1200 = words in=20 50 minutes

 

Dan Connolly suggested codifying the argument I have bandied about as = a pet=20 theory for a few years now: XML is an ideal substrate for archiving the = state of=20 distributed systems. In this case, I mean distributed in the sense of = 'across=20 organizational boundaries' more than I mean 'across address spaces' = (though=20 there's some of that, too).

Let me draw a picture, then we'll go back over the theory (as meager = as it=20 is). Let's recapitulate the most basic case: archiving a network of = dependent=20 objects within one application. Suppose we have a human resources = application=20 with Employee, Department, and Manager::Employee classes and assorted = instances=20 thereof. Let's consider what happens, at first without considering = HTML/XML at=20 all.

As a zeroth attempt, you might 'pickle' a Department by simply = writing down=20 its data structure in memory. Immediately, though, we see that there are = pointers to a Deparment's Manager, so we actually need to package both = objects=20 together to make a complete statement. If you wanted to get all the = Employees=20 within a department, too, then you discover the general case. Since = there is a=20 web of employees, all related to each other because some are managers, = we cannot=20 simply write down the Department, the Manager, and then each Employee = reporting=20 to that manager: we could fall off the edge into a cyclic loop. In fact, = the=20 general case is mark-and-sweep: first you trace out every object = connected to=20 the one at hand; then collect and serialize every affected object in = that=20 subset. [@@expn could be clarified through superclass write: methods = instead=20 (you may know what friends you need pickled, but not what your = superclass=20 implementation might be using.]

This is expensive! Before you can even put the first byte on the = wire, you=20 have to plan out the entire pickle. It's fragile, too, because two = subsequent=20 snapshots may yield separate values for some subsets of the archive: = some=20 reporting roles change, etc. (i.e. duplicate copies of the state of an = object in=20 two archives).

Now, wait: you actually know a radically different way of solving = this=20 problem under your nose: transferring a Web page! A page has many = subsidiary=20 resources, some of which load other subparts in turn; some of which are = shared=20 with other pages, and so on. But WE don't have to pickle: HTTP servers = don't=20 grovel over home pages and send out neatly packaged bundle of=20 html-with-all-embedded-images-and-sounds-in-one-MIME-multipart. We have = the=20 miracle of names! Instead of expensive marshalling burdens on the server = (writer), we just send over the one object at hand with names as = pointers to=20 other resources. Then, let the client (reader) pull what they need to = build a=20 complete map... the delay bottleneck goes away, so we can really stream = these=20 puppies asap. [Now, of course, as a performance optimization, we can = pipeline=20 the next few employee records you'll need to mask the underlying = latency...=20 cache push]

Lesson 1: URLs are an excellent way to capture distributed state. = Versioning,=20 security, etc, can be layered on top of that mechanism. There are = several=20 red-herring issues: space used by long URLs (compress the transport), = fragility=20 of locations (bzzt! Locations =3D=3D names).

Now, let's look at what you put in those pickles anyway. In = traditional=20 speak, there are "primitives" : number, string, float, char, = etc. This=20 is even enshrined in IDL, earlier in many others as well. Even enshrined = on the=20 wire (e.g. XDR). So you end up with binary soup. Now, you and I, we = exchange=20 structured data all the time. Purchase orders, parking tickets, business = cards:=20 i.e. DOCUMENTS. The information in a data structure is trivially = equivalent to a=20 document, but also equivalent at a more profound level. I could make a = document=20 "@*OU9.RohitKhar3.MCI10.6179605131@"... note the length = fields: in a=20 binary stream, heck, you can't even rely on the length of a length field = --=20 ASN.1 inanity highlights this (in fact, it should be introduced earlier) = -- it=20 has length-of-length and so on. But, the document = "<H1.name>Rohit=20 Khare " using style sheets conveys the same information in a = human-readable=20 form.

Lesson 2: Documents are an extremely convenient way of encoding = object state=20 in a way that's usable to humans and computers and is more palatable, = reusable,=20 non-fragile than binary formats.

Now, what if that document could have the MEANING of name, company, = etc,=20 instead of mere formatting hints. That is to say, what if the convention = for=20 exchanging these virtual WebCards was not "the name will be near = the front=20 in an H1", but rather "it will be in a<NAME> = </NAME>=20 element". Well, we can't add new tags to HTML: 1) incompatible = (i.e.=20 mechanically SGML is broken so that new elements are not generally = possible) 2)=20 impossible to document how-to-render (display or not? What font? = Break-before?=20 Break-after? Mutex with some other element?) and 3) ambiguous semantics = (whose=20 NAME? First-last or last, first, etc? multiculturalism?). Well, that's = what DTDs=20 are for (insert "word means what I wish it to mean" quote). = And the=20 beauty of *XML* is that I don't need to compile new programs to process = each=20 DTD: I really CAN dynamically learn about new document types. XML fixes = bugs in=20 SGML. XML adds a real naming scheme for DTDs.

Lesson 3: documents with self-describing grammars are infinitely more = reusable than ad-hoc encodings. (i.e. motivation of *MLs)

But now, we have only defined our precise meaning: nothing lets us = share=20 WebCard semantics yet. What we need is a way to equivalence my vCard to = your=20 WebCard: a filter between XML DTDs. Traditionally, we know filters as = converting=20 between encoding types (e.g. jpg to png to asciiart). DTD calculus lets = us=20 *coordinate across administrative domains* (each having its own = worldview). This=20 brings synergy to the table from ad-hocracy. Instead of being held = prisoner by=20 some industrywide megaproject to define 'employee' and 'purchase order' = like the=20 OMG is doing, we can let it emerge organically.

Lesson 4: declaring what we mean instead of operational encoding = (programs=20 like Java), we can deterministically transform data with high = fidelity.

Look, Documents are Archived Data. We create them by pickling = programs (the=20 output of CGIs), and we can extract pickles back out through = deterministic=20 reverse-engineering (webMethods' package tracker). Most of all, the = evolutionary=20 advantage is that *combo* human/mr documents will be most powerful at = focusing=20 attention. People will invest more to make the 'purchase order' forms = look good,=20 collect the right data, and generally sweat the details which they won't = for=20 relational DB table schemas.

Lesson 5: human-readable documents form an excellent common ground = for many=20 ways to generate and extract data of that kind. There are millions of = employee=20 databases in gadzooolians of languages, but a single pretty common = framework for=20 business cards (admittedly with a thousand variations in the graphic = details --=20 but that's an XML dtd with many CSS sheets).

So, the best way to archive the state of a distributed computation = like the=20 business of an entire corporation is its intranet web. And the best way = to=20 pickle the world is the Web. The best content for it is XML. The = pointers we get=20 with XML are also more powerful and finer-grained so they are better = marshalling=20 tools. This is a protocol issue, too: caching of object state becomes a = visible,=20 soluble issue (it's swept under the rug in rpc/corba/dcom systems = instead).

<theoretical discussion lining up XML against classical OOP = serialization=20 paper's taxonomy>

 

 

 

------=_NextPart_000_01BC8107.57EAEFA0--