[WWW10 Highlights] Compressed and binary XML

John Evdemon john.evdemon@xmls.com
Wed, 23 May 2001 17:22:02 -0400


On 22 May 2001, at 15:49, Jim Whitehead wrote:

> This kind of speedup is only possible with an XML compression 
that understands XML, as opposed to something generic like zip.

Hello all,

At the risk of sounding like a sales guy (which I am proud to say I 
am not), we have a freeware product called XMLZip that might 
interest you all.  XMLZip is a free Java application designed to 
compress XML data.  You can download a copy at: 
http://www.xmls.com/resources/xmlzip.xml?id=resources_xmlzip

Compressing the entire XML file lowers storage and processing 
requirements, but it also renders the DOM API inaccessible (since 
the file must be decompressed prior to being parsed).  XMLZip 
reduces the size of XML files while retaining the accessibility of the 
DOM API (since you only decompress the portions of the file you 
wish to parse).  XMLZip is also capable of selective compression 
and decompression of the files, allowing users to determine the 
DOM level at compression time.

 XMLZip files are constructed based upon a level parameter. The 
level parameter specifies how many nested entity levels down from 
the document root to go before compressing subtrees. For 
example, a level 2 XMLZip file has each child of the document root 
replaced with a reference to the compressed subtree (I have 
attached an uncompressed XML file and the associated XMLZip file 
as an example).   

An XMLZip file is a ZIP format file. The first entry in the file is the 
XML document tree prefix. The tree prefix is the original XML file 
with selected subtrees replaced with a single, empty tag. The 
subtrees that are removed are all the sibling subtrees at a 
particular level. The first entry is denoted in the XMLZip file as a ZIP 
entry with a name ending in ".0".  

The last entry in the XMLZip file is the index. It lists all the   
compressed XMLZip file entries and the ID attribute of the 
corresponding <xmlzip> tag. The index entry in the XMLZip file has
  a name ending in ".i".

Both the tree prefix (the .0 entry) and the index (the .i entry) are 
uncompressed for quick extraction.

The remaining entries in the XMLZip file are the compressed 
subtree fragments. Each entry has a name that ends in a dot 
followed by a number greater than 0.

We have a customized version of XML4J (IBM's Java-based XML 
Parser) that has integrated support for XMLZip - effectively shielding 
the developer from the "unzip" process.  Let me know if you are 
interested and I'll shoot you a copy of it (totally unsupported, 
natch!).

John Evdemon
CTO
XMLSolutions
www.xmls.com
www.vitria.com