How the Wayback Machine works

Jim Whitehead
Wed, 23 Jan 2002 09:57:40 -0800

A link I picked up from Dave's Scripting News:

Some quotes:

Brewster Kahle: In the Wayback Machine, currently there are 10 billion Web
pages, collected over five years. That amounts to 100 terabytes, which is
100 million megabytes. So if a book is a megabyte, which is about what it
is, and the Library of Congress has 20 million books, that's 20 terabytes.
This is 100 terabytes. At that size, this is the largest database ever
built. It's larger than Walmart's, American Express', the IRS. It's the
largest database ever built. And it's receiving queries -- because every
page request when people are surfing around is a query to this database --
at the rate of 200 queries per second. It's a fairly fast database engine.
And it's built on commodity PCs, so we can do this cost-effectively. It's
just using clusters of Linux machines and FreeBSD machines.

Koman: How many machines?

Kahle: Three hundred, we may be up to 400 machines now. When we first came
out, we didn't architect it for the load we wound up with, so we had to
throw another 20 to 30 machines at serving the index.


We can buy 100 TBs with 250 CPUs to work on it, all on a high-speed switch
with redundancy built in. Something has changed by using these modern
constructs that are heavily used at Google, Hotmail, here, Transmeta.
There's a whole sector of companies that are more cost-constrained than say,
banks, that just buy Oracle and Sun and EMC.


So if all books are 20 TBs, and 20 TBs are $80,000, that's the Library of
Congress. Then something big has changed. All music? It's tiny. It looks
like there're only one million records that have been produced over the last
century. That's tiny. All movies? All theatrical releases have been
estimated at 100,000, and most of those from India. If you take all the rest
of ephemeral films, that's on the order of a couple hundred thousand. It's
just not that big. It allows you to start thinking about the whole thing.


- Jim