Re: Real-time distributed Web search (Gnutella knockoff?)

Date view Thread view Subject view Author view

From: Adam L. Beberg (beberg@mithral.com)
Date: Fri Jun 30 2000 - 18:42:37 PDT


On Fri, 30 Jun 2000, Kragen Sitaker wrote:

> Well, you definitely have more experience with building large-scale
> distributed systems than I do. :) More, actually, than almost anyone
> does. Could you take the time to explain, in short words so we can
> understand, how familiar rules of hierarchy and specialization don't
> apply?

I'll take that as a compliment, I think. Hmmm, I can't take some time,
but i'll do it anyway, tho I'll try not to spoil my TWIST topics for all
the FoRKies that will be there. Won't use small words tho ;)

Hierarchy and specialization are how humans cope with the universe, so
that's how they design things. You report to your boss, and goto a
doctor when you're about to die.

In a hierarchy with N systems you basicly have N-1 connections going,
but since they are localized the load on the "big net" is more on the
orders of sqrt(N). Now, the web is not quite a heirarchy, but in the old
days of proxy-caches, before Akami and friends broke them, before
dynamic content was everywhere it didn't belong, it was reasonably
close. This of course is the bandwidth conserving, efficient, fast zoom
zoom case. The internet itself is organized this way - as are all
infrastructure grids - backbone, reginal hubs, ISPs, you. M-bone used to
work this way too, dynamic hierarchy, which could have handled that
victoria secret show without even a netblip.

In a specialized system like google, you have a central specialized
system handling all N connections will low load per node. It's a
bandwidth hog to a point, but it's still very efficient since you dont
do any caching anyway. This is where the web is now, since every page I
visit refreshes every time I move the mouse. This is probably where
"real world" things will stay basicly forever, it's not perfect or even
decent, but it is optimal for the advertisers.

And in so called "distributed" (used in its buzzword form) systems like
Gnutella, freenet, which are actually _broadcast_ systems, where N nodes
give you more on the order of N! connections. Fine for small N, but
quickly exploding to a molten mess of bits. This is probably best known
as the "well it worked in the lab" case, but usually by more obscene
names. USENET works this way, everything gets copied 50,000+ times no
matter if anyone wants it or not. You really only need to do this if
you're pretending to hide from people to break laws, otherwise this
method is just too stupid and wasteful. Hopefully datahavens will allow
people to stop thinking that this is a good idea at all.

Guess which category a "distributed search engine" is in :)

Now in a true distributed system (non-buzzword), everything is dynamic,
and general purpose. The ideas of "here" and "there" no longer even
apply, as the ideas of "me" and "not-relivant" emerge. You basicly have
network goo, with no rules, no hierarchy, no specilization.

The system has to be intelligent enough to form internal heirarchies,
specializations, and other global optimizations on its own, and on the
fly, and do it all well enough that the thing doesn't melts. Since
people cant even teach stoplights to coordinate traffic intelligently
(take the bus!), distributed systems are still considered somewhat
tricky to do right.

Almost everything in Cosm is in the first category, but I'll eventually
find a way to fix the stuff still in the second too, if it's possible
before I have to give up and get a day job.

As a related mini-bit, even Oracle only kinda-sorta-cross-your-fingers
has distributed databases working after trying for decades. It's
non-trivial, if even possible at all. And since a distributed search
engine is really just a distributed database...

- Adam L. Beberg
  Mithral Communications & Design, Inc.
  The Cosm Project - http://cosm.mithral.com/
  beberg@mithral.com - http://www.iit.edu/~beberg/


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Fri Jun 30 2000 - 18:46:39 PDT