XML tag reader for Mithril (was Re: The defn XML series)

David McCusker (davidmc@netscape.com)
Fri, 10 Apr 1998 15:22:16 -0700


Mark Baker wrote: [ responding to Dirk Riehle ] [ snip ]
> (Ooooh, noooo, not another XML thread! Way to hit-and-run Rifkin. 8-)

Maybe someone would like to help me brainstorm about data structures
for representing XML parse trees. That might be more fun than having
me express opinions about how XML might actually be applied in use. :-)
The substance of my question today concerns how many different kinds of
tag types there are in XML and XML doc definition syntax. For example,
how many chars other than '?' and '!' follow '<' with special meaning?

I'm sure I could infer this information quickly by reading tons of
available XML documents and drafts. But perhaps someone has already
distilled this information. And folks might be interested in the manner
in which I might use this knowledge in a language, as mentioned below.
Perhaps the general idea will suggest other useful notions to folks.

I have today off, so I'm thinking about expression readers for my
nascent Mithril language, and a fellow named Luther Huffman is making
me think about how Mithril might be used with DSSSL the way that scheme
is current being used. (Because Mithril will start out as hybrid of
both Smalltalk and Scheme in a lisp variant syntax, before it's used to
bootstrap other high level languages, or simply different syntaxes.)

I've been planning to allow a Mithril code source input stream contain
any number of syntaxes, as long as an appropriate XML style tag is used
to delineate various passages. So the initial simple lisp syntax might
be enclosed in tags like <mithril:lisp:syntax>...</mithril:lisp:syntax>.

However, I also want to parse XML trees directly into data structures
that can manipulated by Mithril programs, so tags will have the same
first class status as pairs which compose lists. Of course they could
be encoded as lists with appropriate magic symbols, but I'd rather have
a distinctly typed object of type "tag" which is clearly not a "pair".

What concerns me is capturing the special characters used in tags like
'!' and '?' which are not considered simply part of a tag's name, and
encoding the tag type appropriately without including such characters
physically embedded in tag names.

I was thinking of encoding a tag in the same space used by a pair, with
the head slot pointing to the tag's name, and the tail slot pointing to
the possibly empty list or vector of tag attributes (each represented as
a pair associating name with value). I have a bunch of bits left over
in the slot preceding the tag that marks this object as a tag, so I can
encode many different kinds of specific tag types, as long as the list
is not open-ended. (Note that compilation can change the formats used.)

I'm sorry if this level of implementation detail is a crashing bore to
some folks. (But I confess that doesn't really bother me either. :-) I'm
interested in what kinds of semantic and ontological information folks
would like to capture about XML structures, so code can be written to
manipulate such details directly using a high level coding language.

David McCusker, looking for the sweet spot where ontology meets the road
Values have meaning only against the context of a set of relationships.