[WWW10 Highlights] Compressed and binary XML
Wed, 23 May 2001 17:22:02 -0400
On 22 May 2001, at 15:49, Jim Whitehead wrote:
> This kind of speedup is only possible with an XML compression
that understands XML, as opposed to something generic like zip.
At the risk of sounding like a sales guy (which I am proud to say I
am not), we have a freeware product called XMLZip that might
interest you all. XMLZip is a free Java application designed to
compress XML data. You can download a copy at:
Compressing the entire XML file lowers storage and processing
requirements, but it also renders the DOM API inaccessible (since
the file must be decompressed prior to being parsed). XMLZip
reduces the size of XML files while retaining the accessibility of the
DOM API (since you only decompress the portions of the file you
wish to parse). XMLZip is also capable of selective compression
and decompression of the files, allowing users to determine the
DOM level at compression time.
XMLZip files are constructed based upon a level parameter. The
level parameter specifies how many nested entity levels down from
the document root to go before compressing subtrees. For
example, a level 2 XMLZip file has each child of the document root
replaced with a reference to the compressed subtree (I have
attached an uncompressed XML file and the associated XMLZip file
as an example).
An XMLZip file is a ZIP format file. The first entry in the file is the
XML document tree prefix. The tree prefix is the original XML file
with selected subtrees replaced with a single, empty tag. The
subtrees that are removed are all the sibling subtrees at a
particular level. The first entry is denoted in the XMLZip file as a ZIP
entry with a name ending in ".0".
The last entry in the XMLZip file is the index. It lists all the
compressed XMLZip file entries and the ID attribute of the
corresponding <xmlzip> tag. The index entry in the XMLZip file has
a name ending in ".i".
Both the tree prefix (the .0 entry) and the index (the .i entry) are
uncompressed for quick extraction.
The remaining entries in the XMLZip file are the compressed
subtree fragments. Each entry has a name that ends in a dot
followed by a number greater than 0.
We have a customized version of XML4J (IBM's Java-based XML
Parser) that has integrated support for XMLZip - effectively shielding
the developer from the "unzip" process. Let me know if you are
interested and I'll shoot you a copy of it (totally unsupported,