Project Gutenberg markup

Date view Thread view Subject view Author view

From: Kragen Sitaker (
Date: Mon Mar 13 2000 - 05:32:54 PST

Dan Kohn writes:
> As for XML vs. ASCII, I think Michael Hart has done a disservice to the open
> source ebook community by insisting solely on ASCII. Specifically, I'm
> concerned that his use of ALL CAPS (and lack of any authoring guidelines
> regarding encoding structure) eliminates the simplest HTML distinction
> between Emphasis and Strong, let alone the reasonable application of
> footnotes.

It sure would be nice to have documents in a better-than-ASCII format.
Something like setext ( would be just great
--- looks fine as ASCII, yet its conventions carry more information.

We'd have to come up with a setext variant suitable for novels and
other books, though, and I'm not convinced the same markup can
reasonably be used for texts as different as the King James Bible,
Shakespeare's Antony and Cleopatra, The Adventures of Sherlock
Holmes, and Webster's Dictionary, all of which are in PG today.

Whatever you may think of Mr. Hart's particular technology choices,
Project Gutenberg in its present form is much, much better than
nothing. So I am reluctant to criticize him as having done a
"disservice" to anybody by not doing it perfectly.

I also suspect that marking up an existing etext (particularly with
Perl to help you) is much less effort than typing it in.

> Yes, it was hardly clear in 1971 that XML would become the One True Way(tm),
> but INTIME and GML were available and (with a simple DTD) would have been
> perfect. Requiring the latest beta of a web browser is a complete red
> herring, because all files could have (and now will be) made available as
> ASCII as well. It's just that the source would have included structure, for
> those who wanted it. I think it's unreasonable to put a generic markup
> system in the same category as an OS or program, given that the former can
> be programmatically reduced to plain vanilla ASCII in seconds.

The advantage of markup was probably not apparent in 1971, when
graphical displays were toys of Ivan Sutherland and Xerox PARC; when
high-resolution printers and typesetters were an incredible luxury;
when the laser printer had not yet been invented; and when Moore's Law
was not yet a commonplace.

It should be apparent today --- it makes it possible to write tools to
make it easy to navigate a document intelligently, to render it
decently on printed output, and for some documents, to do other
interesting things. (Finding a particular verse in the Bible from a
reference like "1 Kings 5:7", for example.)

For a while, I've wanted to run a kiosk-sized bookstore; anyone could
come in and browse books on the screen, and for $5 or so, they could
have them printed out and bound. (Printing and binding should take
about ten minutes with a $5,000 equipment budget and an experienced
staff member.) You could reasonably run a bookstore with 50,000 titles
in a 40-square-foot space --- once those titles were in digital form
available to you!

> Kragen could probably write a Perl program to do it in under 10 minutes (and
> yes, Perl wasn't around in 1971 either, but Cobol was). It's always easy to
> remove information (editing out tags), but (until we have strong AI, see
> above) it's impossible to programmatically (and reliably!) add structure
> into documents that are lacking it.

It would probably take me 15 minutes with Perl. I don't think anyone
could do it in 15 minutes with COBOL; perhaps two days.

However, it would not take Michael Hart only two days, with COBOL or
Perl or with any other language. It is understandable that he would
not want to organize such a project around technology that would have
to be purpose-built for it and which he was not able to maintain

It is natural for programmers and folks who work with programmers to
think of organizing a project like PG around an edifice like a markup
system. It is probably not so natural for folks who don't.

Nevertheless, something like setext would help quite a bit; no
stripping program would be needed. As long as incoming etexts were
vetted by a structure-validation program, we could be confident that
the structure was there, but like a ghost, it would disappear if you
looked right at it.

> Thus, I presume we would do the data entry as XML and submit both XML and
> ASCII versions simultaneously.

Generating ASCII versions would be best done by PG itself, not the

<>       Kragen Sitaker     <>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
The power didn't go out on 2000-01-01 either.  :)

Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Mon Mar 13 2000 - 05:33:27 PST