[FoRK] Google: 'At scale, everything breaks'

Eugen Leitl eugen at leitl.org
Thu Jun 23 03:10:11 PDT 2011


his story was printed from ZDNet UK, located at http://www.zdnet.co.uk/

Story URL:

Google: 'At scale, everything breaks'

By Jack Clark, 22 June 2011 16:00


Google operates technology that is expected to be reliable in the face of
major traffic demands.

To scale its services, the company has developed many systems, such as
MapReduce and Google File System, that have since been made open source by
Yahoo and worked into the popular Hadoop data-analytics framework.

However, behind the scenes, the company is fighting a constant battle against
the twin demons of cascading failovers and the increasingly challenging
levels of complexity that massively scaled services bring.

Urs Hölzle was Google's first vice president of engineering. Before joining
Google he worked on high-performance implementations of object-orientated
languages, contributed to Darpa's national compiler infrastructure project,
and developed compilers for Smalltalk and Java.

According to Hölzle, "at scale, everything breaks", and Google must walk a
tightrope between increasing the scaling of its systems while avoiding
cascading failovers, such as the outage that affected Gmail in March this

Q: Apart from focusing on physical infrastructure, such as datacentres, are
there efficiencies that Google gains from running software at massive scale?

A: I think there absolutely is a very large benefit there, probably more so
than you can get from the physical efficiency. It's because when you have an
on-premise server it's almost impossible to size the server to the load,
because most servers are actually too powerful and most companies [using
them] are relatively small.

[But] if you have a large-scale email service where millions of accounts are
in one place, it's much easier to size the pool of servers to that load. If
you aggregate the load, it's intrinsically much easier to keep your servers
well utilised.

What are Google's plans for the evolution of its internal software tools?

There's obviously an evolution. For example, most applications don't use
[Google File System (GFS)] today. In fact, we're phasing out GFS in favour of
the next-generation file system that is very similar, but it's not GFS
anymore. It scales better and has better latency properties as well. I think
three years from now we'll try to retire that because flash memory is coming
and faster networks and faster CPUs are on the way and that will change how
we want to do things.

One of the nice things is that if everyone today is using the Bigtable
compressed database, suppose we have a better Bigtable down the line that
does the right thing with flash — then it's relatively easy to migrate all
these applications as long as the API stays stable.

How significant is it to have these back-end systems — such as MapReduce and
the Google File System — spawn open-source applications such as Hadoop
through publication and adaptation by other companies?

It's an unavoidable trend in the sense that [open source] started with the
operating system, which was the lowest level that everyone needed. But the
power of open source is that you can continue to build on the infrastructure
that already exists [and you get] things like Apache for the web server. Now
we're getting into a broader range of services that are available through the

For instance, cluster management itself or some open-source version will
happen, because everyone needs it as their computation scales and their issue
becomes not the management of a single machine, but the management of a whole
bunch of them. Average IT shops will have hundreds of virtual machines (VMs)
or hundreds of machines they need to manage, so a lot of their work is about
cluster management and not about the management of individual VMs.

Often, if computation is cheap enough, then it doesn't pay to...

...do your own solution. Maybe you can do your own solution but you [might
not be able to] justify the software engineering effort and then the ongoing
maintenance, instead of staying within an open-source system.

In your role, what is the most captivating technical problem you deal with?
I think the big challenges haven't changed that much. I'd say that it's
dealing with failure, because at scale everything breaks no matter what you
do and you have to deal reasonably cleanly with that and try to hide it from
the people actually using your system.

    At scale everything breaks no matter what you do and you have to deal
reasonably cleanly with that and try to hide it from the people actually
using your system.

There are two big reasons why MapReduce-Hadoop is really popular. One is that
it gets rid of your parallelisation problem because it parallelises
automatically and does load-balancing automatically across machines. But the
second one is if you have a large computation, it deals with failures. So if
one of [your] machines dies in the middle of a 10-hour computation, then
you're fine. It just happens.

I think the second one is dealing with stateful, mutable states. MapReduce is
easy because it's a case of presenting it with a number of files and having
it compute them and, if things go wrong, you can just do it again. But Gmail,
IM and other stateful services have very different security [uptime and
data-loss] implications.

We use tapes, still, in this age because they're actually a very
cost-effective way as a last resort for Gmail. The reason why we put it in is
not physical data loss, but once in a blue moon you will have a bug that
destroys all copies of the online data and your only protection is to have
something that is not connected to the same software system, so you can go
and redo it.

The last challenge we're seeing is to use commodity hardware but actually
make it work in the face of rapid innovation cycles. For example, here's a
new HDD (hard-disk drive). There's a lot of pressure in the market to get it
out because you want to be the first one with a 3TB drive and there's a lot
of cost pressure to, but how do you actually make these drives reliable?

As a large-scale user, we see all the corner cases and in virtually every
piece of hardware we use we find bugs, even if it's a shipping piece of

If you use the same operating system, like Linux, and run the same
computation on 10,000 machines and every day 100 of them fail, you're going
to say, wow this is wrong. But if you did it by yourself, it's a one-percent
failure rate. So three times a year you'd have to change your server. You
probably wouldn't take the effort to debug and you'd think it was a random
fluke or you'd debug and it wouldn't be happening any more.

It seems you want all your services to speak to each other. But surely this
introduces its own problems of complexity?  Automation is key, but it's also
dangerous. You can shut down all machines automatically if you have a bug.
It's one of the things that is very challenging to do because you want
uniformity and automation, but at the same time you can't really automate
everything without lots of safeguards or you get into cascading failures.

    Keeping things simple and yet scalable is actually the biggest challenge.

Complexity is evil in the grand scheme of things because it makes it possible
for these bugs to lurk that you see only once every two or three years, but
when you see them it's a big story because it had a large, cascading effect.

Keeping things simple and yet scalable is actually the biggest challenge.
It's really, really hard. Most things don't work that well at scale, so you
need to introduce some complexity, but you have to keep it down.

Have you looked into some of the emerging hardware, such as PCIe-linked
flash?  We're not a priori excluding anything and we're playing with things
all the time. I would expect PCIe flash to become a commodity because it's a
pretty good way of exposing flash to your operating system. But flash is
still tricky because the durability is not very good.

I think these all have the promise of deeply affecting how applications are
written because if you can afford to put most of your data in a storage
medium like this rather than on a moving head, then a factor of a thousand
makes a huge difference. But these things are not close enough yet to disk in
terms of storage cost.

More information about the FoRK mailing list