How the Wayback Machine works

Jim Whitehead ejw@cse.ucsc.edu
Wed, 23 Jan 2002 10:03:51 -0800


D'oh -- just realized this was ./'ed today. It's still a good article. :-)

http://slashdot.org/articles/02/01/23/139240.shtml

- Jim

> -----Original Message-----
> From: fork-admin@xent.com [mailto:fork-admin@xent.com]On Behalf Of Jim
> Whitehead
> Sent: Wednesday, January 23, 2002 9:58 AM
> To: FoRK
> Subject: How the Wayback Machine works
>
>
> A link I picked up from Dave's Scripting News:
>
> http://www.oreillynet.com/pub/a/webservices/2002/01/18/brewster.html
>
> Some quotes:
>
> Brewster Kahle: In the Wayback Machine, currently there are 10 billion Web
> pages, collected over five years. That amounts to 100 terabytes, which is
> 100 million megabytes. So if a book is a megabyte, which is about what it
> is, and the Library of Congress has 20 million books, that's 20 terabytes.
> This is 100 terabytes. At that size, this is the largest database ever
> built. It's larger than Walmart's, American Express', the IRS. It's the
> largest database ever built. And it's receiving queries -- because every
> page request when people are surfing around is a query to this database --
> at the rate of 200 queries per second. It's a fairly fast database engine.
> And it's built on commodity PCs, so we can do this cost-effectively. It's
> just using clusters of Linux machines and FreeBSD machines.
>
> Koman: How many machines?
>
> Kahle: Three hundred, we may be up to 400 machines now. When we first came
> out, we didn't architect it for the load we wound up with, so we had to
> throw another 20 to 30 machines at serving the index.
>
> --------------
>
> We can buy 100 TBs with 250 CPUs to work on it, all on a high-speed switch
> with redundancy built in. Something has changed by using these modern
> constructs that are heavily used at Google, Hotmail, here, Transmeta.
> There's a whole sector of companies that are more
> cost-constrained than say,
> banks, that just buy Oracle and Sun and EMC.
>
> --------------
>
> So if all books are 20 TBs, and 20 TBs are $80,000, that's the Library of
> Congress. Then something big has changed. All music? It's tiny. It looks
> like there're only one million records that have been produced
> over the last
> century. That's tiny. All movies? All theatrical releases have been
> estimated at 100,000, and most of those from India. If you take
> all the rest
> of ephemeral films, that's on the order of a couple hundred thousand. It's
> just not that big. It allows you to start thinking about the whole thing.
>
> ---------------
>
> - Jim
>
>
>
> http://xent.com/mailman/listinfo/fork