Re: [Stupid Idea Series] WWW/2 Take 2

Robert S. Thau (rst@ai.mit.edu)
Mon, 1 Mar 1999 21:28:34 -0500 (EST)


Kragen Sitaker writes:
> Re: automatically determining what pages are actually versions of each
> other: I put together a little piece of software that heuristically
> does exactly that, a couple of weeks ago. It seemed to do a reasonable
> job under some limited circumstances. I haven't tested it on many data
> sets, though, and it's currently O(N^2) in the number of documents.

Your algorithm works, essentially, by mapping each document to a point
in a high-dimensional space, and looking for nearby points.

There is an extensive literature on clustering algorithms, and on
data structures for speeding them up. I'm not current enough to
recommend any recent survey paper, but you could easily make the thing
asymptotically O(n log n) by building tree structures in document
space, and searching tree nodes near a document's own for near
neighbors...

rst