Re: [Stupid Idea Series] WWW/2 Take 2

Robert S. Thau (rst@ai.mit.edu)
Mon, 1 Mar 1999 21:28:34 -0500 (EST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Gregory Alan Bolcer: "Killer Appliances"
Previous message: Robert S. Thau: "Re: Windows 2001 Sneak Peek"
In reply to: Kragen Sitaker: "Re: [Stupid Idea Series] WWW/2 Take 2"
Next in thread: Tom Arkwright: "Re: [Stupid Idea Series] WWW/2 Take 2"

Kragen Sitaker writes:
> Re: automatically determining what pages are actually versions of each
> other: I put together a little piece of software that heuristically
> does exactly that, a couple of weeks ago. It seemed to do a reasonable
> job under some limited circumstances. I haven't tested it on many data
> sets, though, and it's currently O(N^2) in the number of documents.

Your algorithm works, essentially, by mapping each document to a point
in a high-dimensional space, and looking for nearby points.

There is an extensive literature on clustering algorithms, and on
data structures for speeding them up. I'm not current enough to
recommend any recent survey paper, but you could easily make the thing
asymptotically O(n log n) by building tree structures in document
space, and searching tree nodes near a document's own for near
neighbors...

rst

Next message: Gregory Alan Bolcer: "Killer Appliances"
Previous message: Robert S. Thau: "Re: Windows 2001 Sneak Peek"
In reply to: Kragen Sitaker: "Re: [Stupid Idea Series] WWW/2 Take 2"
Next in thread: Tom Arkwright: "Re: [Stupid Idea Series] WWW/2 Take 2"