From: Gavin Thomas Nicol (firstname.lastname@example.org)
Date: Mon Sep 18 2000 - 06:49:16 PDT
> No, no, no. You want to present a document to the world. Taking your
> Boeing manual example, we have pages, TOC, list of figures, and the
> like. If the thing is too large to be indexed as a whole (certainly in
> your example), you can indeed fragment it into smaller pieces (say,
> pages), and index them independently. You can pull up every page with
> an unique query. That's your URI of the individual atomic
Right. The problem here is that the pages *may* be generated dynamically
(dynamic fragmentation), and the fragmentation itself is a function of
the runtime environment. For example, access via Opera, or a PDA, would
result in something vastly different from something accessed via Netscape
The point being that the list of URL's, and the various flavors of them,
is potentially gigantic. For even a single targetted media (IE5, for
example), there could be many million URL's.
> Your page has some text, which will be full-text
> indexed. You leave the up-to-date full text index in a specific
> location, and notify the web crawler to pick up the index.
I don't get this. Often, the index size is about the size of the original
document (the larger the document, the smaller the ratio). I don't
see the benefit of making it available. What does the search engine index?
What links does it put in it's database? Does it put a link to the
Also, even though the URL's are stable (as they are the result of
a function evaluation), my point is that computing the full set of
links is potentially a tremendously expensive thing to do.
> instead of millions) and on the network (these indexes are darn
They can only be compact if they are not accurate.
> A typical URI does include the database query to reach your atomic
> document. I see no problems with associating a full text index of the
> document (plus META keywords for pictures, and the like) with that URI.
My point is that the document, does not exist until asked for... hence
generating the document, the index, and the entire list of URL's may
not be feasible... even on a static document.
This archive was generated by hypermail 2b29 : Mon Sep 18 2000 - 06:47:30 PDT