[FoRK] pixels to petaflops: how GPUs are pushing the HPC envelope

Eugen Leitl <eugen at leitl.org> on Fri Jun 22 09:27:50 PDT 2007

http://www.linux-mag.com/launchpad/business-class-hpc/main/3533

Pixels to PetaFLOPS: How GPUs are Pushing the HPC Envelope

Monday, June 18th, 2007By Jeffrey B. Layton, Ph.D.

Everyone knows the clock speed increases in commodity CPUs have slowed down
to a crawl. There are several reasons for this, but the bottom line is that
we aren't likely to see increasing performance with each new processor as we
have in the past. At the same time Moore's Law is still holding, so the
number of transistors per processor is increasing. CPU companies have taken
the extra transistors and used them to create extra cores and cache, leading
us into the multi-core era of computing. Cores are great, but there are some
challenges facing many programmers as they convert to multi-core systems.

First, some applications are driven by single core performance. They can run
over multiple cores using threads, OpenMP, or MPI, but in the end the overall
code performance is driven by the single limiting process on a single core.
Second, the performance of the memory and I/O buses are not increasing as
rapidly as the number of cores is increasing. This situation means that the
amount of memory bandwidth per core is decreasing, reducing performance.
Also, the number of cores per network interface (NIC) is rapidly increasing,
possibly reducing the performance, or at the very least requiring high
performance networks.

The challenges with multi-core CPUs in HPC can be overcome to a certain
degree, but people are still looking for more performance from their systems.
This quest has led people to look around for other commodity hardware that
can be used to increase performance. Enter the modern day Graphics Processing
Unit (GPU).

History Repeats Itself

Let's step back a few years and recall how clusters evolved. Clusters arose
because the traditional HPC systems were expensive and processing time was
limited -- HPC in the cathedral as it were. At the same time, PC components
were becoming faster. At NASA Goddard, Tom Sterling and Don Becker
demonstrated the Beowulf concept using commodity PC parts and open source
software. Their target cost was $50,000. Such a system would drastically
reduce the entry cost for a local HPC resource. About that time, commodity
parts (x86 hardware) were becoming price/performance competitive with the
fastest workstations. In addition, Fast Ethernet was making it's entry into
the market. (Open source and the Internet came along as well, but that is
another story.) The first Beowulf was a huge success and forged a path for
true commodity-based HPC computing. This trend continues today where the HPC
practitioner continues to leverages high performance at commodity prices.

As a rule, HPC users are always looking to gain more performance. One of the
options is a hybrid scheme that adds a co-processor to the system. The
co-processor(s) are used for computations on the node. There are several
co-processor options being pursued, including using FPGAs (Field Programmable
Gate Arrays), or specialized co-processors such as Clearspeed. But these are
clearly not commodity-based solutions and have a subsequent higher cost.
There is, however, a commodity co-processor component that has shown
continued growth in specialized performance -- modern day graphics cards.

Thanks in part to the PC gaming industry, today's graphics cards have very
high-performance GPUs (Graphic Processing Units) as computational engines.
These GPUs are programmable, so you can write code for them. In essence, you
are using your high-end graphics card not for slaying monsters and aliens,
but for computing.

Many people get excited when a CPU is released at a higher GHz. A bit more
performance, plug and play, life is great. GPUs on the other hand have been
quietly, at least to the HPC world, taking giant leaps and bounds in
performance. Figure One below illustrates the trend in the theoretical
performance of ATI and NVIDIA GPUs from about 2003 to 2005 along with the
performance of Pentium 4 CPUs.  Figure One: Theoretical performance of ATI,
NVIDIA GPUs and Pentium 4 CPUs (Used with permission from Ian Buck, Stanford
University)

Note the huge increase in performance of the GPUs relative to the CPUs. At
the end of the chart, in mid-2005, the performance of these GPUs is several
times the performance of the CPUs. For example, Intel's desktop quad core
product, the Core 2 extreme QX6700 runs at 2.6 GHz and has a peak floating
point performance of 50 GFLOPS, while NVIDIA's latest GPU, the G80, has a
peak floating point performance of 330 GFLOPS. ATI's Radeon HD 2900XT peak
performance is about 475 GLOPS. Using Crossfire, two ATI Radeon cards, and a
little overclocking, the peak performance has reached approximately 1 TFLOP!
Granted these numbers are single precision at the moment and for specialized
computations, but the potential performance is huge and the hardware is
available now.

Why is the GPU performance increasing so quickly? The PC gaming sector is
large and growing, and is currently considered a multi-billion dollar market.
The economic motivation for developing faster GPUs is obvious. In addition,
it may be easier to take extra transistors and use them for GPU computations
rather than just cache because of the nature of GPU calculations. (CPUs are
using extra transistors for more cores not bigger cores). Moreover, if you
haven't noticed, the release cycle for GPUs is faster than for CPUs due to
their less complex designs. So, with a high potential performance at perhaps
a cheaper price point, you can see why GPUs are being considered as a
co-processor or even a main processor.

GPUs are Different

GPUs are fundamentally different compared to regular CPUs. Currently the GPUs
are designed to take a 3D model of a scene and pass it through various steps
in the rasterization based rendering process to create an image that is
displayed. To increase the performance of this process, modern GPUs are
designed to pipeline the data. Within the rasterization process, there a
number of stages that are performed in series using the GPUs and code
fragments to process the data.

For computations, rather than rasterization, the GPUs can be thought of as
Stream Processors. Stream processing takes a series of compute-intensive
operations (called kernels) and applies them to every element in a set of
data that is called a stream. In general, GPUs apply one kernel at a time to
a particular stream, but multiple streams can be processed at the same time
(they are independent of one another). But the kernel is not an entire
program, but rather just a part of the algorithm. So to compute an entire
algorithm, several kernels are applied in series to the stream. However,
since the kernel has to be applied to the entire stream, there are things
that you can't do in GPUs such as reading and writing at the same time that
you can do in CPUs.

Consequently you have to think of the data and the program in a different way
when writing for GPUs. You have to think of the algorithm in terms of
processing streams of data using a kernel that does the same operation on the
entire stream. You can chain together kernels to perform series of operations
but each kernel is restricted to apply the same operations on the entire
stream. The benefits of stream processing are potentially huge! In
particular:

    * There is explicit parallelism in streams (data and task)

    * No communication between stream elements (simplifies operations)

    * Kernels cannot write to the input streams

    * Stream elements are independent

    * Parallelism hides memory access (latency)

While it may not be obvious how these items contribute to performance, if
your code follows the stream model of computing, it can give you a huge boost
in performance. But you have to live within the stream computing model and
it's requirements (Sorry no free lunch).

The concept of an attached processor is not new. In the days of big iron HPC,
there were specialized devices called "array processors" that were attached
to HPC computers systems. These devices, much like the GPUs of today, were
designed to do specific tasks faster than the host CPU. It seems that the
commodity graphics market has allowed an old idea to become new again.

Some of the requirements or limitations of stream computing, in addition to
the ones previously discussed, are:

    * No stack or heap

    * No integer or bit-wise operations

    * No scatter operations (a[i] = b)

    * No reduction operations (max, min, sum)

These features are missing because games haven't asked for them (yet) and
it's still early in the GPGPU (General Purpose Graphic Processor Unit)
development (in other words they may come as the demand warrants it). But, if
you can get around these limitations and follow the stream programming model,
the potential for a huge gain in performance at low cost is out there.

Programming GPUs

There are two major programming APIs for GPUs: OpenGL and DirectX. This means
you have to translate the algorithm into OpenGL or DirectX functions. This is
not an easy task and you have to understand both the algorithm and the code
you are writing and also how to program a GPU. Even using the stream
processing concepts, it is difficult to write code using the GPU APIs.

For example, instead of talking about storage (memory) you have to talk about
textures. Instead of talking about a computational kernel, you have to talk
about fragment programs. And instead of talking about a foreach loop over a
data stream, you have talk about a render pass. There are few people who
understand computational algorithms and GPU coding. The obvious solution is
to develop high-level languages for writing GPU code.

Higher level languages abstract many of the fundamental GPU codings that are
needed for expressing algorithms. This mapping makes it easier for people who
are not graphics coders to write code that runs on GPUs. Some of the higher
level languages (or standard languages with extensions) for GPUs are:

    * Sh

    * Brook

    * Shallows

    * CUDA (Compute Unified Device Architecture)

    * RapidMind

The languages vary but most of them look something like C and C++ and
incorporate stream concepts. For example, Brook treats the GPU as a streaming
processor, and the language looks like C with some stream extensions. At one
time there was also a derivative of Brook that was being developed called
Brooktran that was close to Fortran yet based on Brook. Brook works with ATI
and NVIDIA GPUs, OpenGL and DirectX, as well as Windows and Linux.

CUDA is very interesting because NVIDIA created a new model for programming
general purpose computations on GPUs. It is data parallel computing using
thousands of threads with a Parallel Data Cache to help increase arithmetic
intensity for large performance boosts (arithmetic intensity refers to the
compute intensity of the code). The nice thing is that you can program in C
and then use extensions to program for the GPUs. This feature allows you to
target certain portions of the code for execution on the GPU and the rest to
run on the CPU. CUDA has 4 sets of extensions:

    * Function type qualifiers (defines if a function can run or call from
the CPU or GPU)

    * Variable type qualifiers (specify memory location of a variable on a
GPU)

    * A directive to specify how a kernel is executed on the GPU from the CPU

    * Four variables that specify the grid and block dimensions from the CPU
(this describes the problem so the run-time can put the treads in the
appropriate location)

As you can see, the extensions are fairly minor. You then take your code and
build it with NVCC (NVIDIA's compiler that comes with CUDA), which splits the
code into two parts: the GPU-specific code, and the CPU-specific code. The
CPU-specific code is output so you can compile it outside of NVCC. The GPU
code is compiled into a GPU binary form.

CUDA has a run-time library that runs on the CPU (the host) and provides
functions to access GPUs (possibly more than one) and to control them. It
also has a component that runs on the GPUs that provide specific GPU
functions, and a "common component" that has built-in vector data types
(remember to think of the GPUs as stream processors) and a subset of the
Standard C Library that runs on both the CPU and the GPUs. It also comes with
pre-built BLAS libraries (Basic Linear Algebra Subprograms) and FFT libraries
(Fast Fourier Transforms). One nice thing about CUDA is that it allows for
combined CPU and GPU programming. It is currently specific to NVIDIA GPUs and
it is freely available.

GPU Computing Resources

There are some GPU resources available around the Internet and fortunately
there seems to be one central GPU website that links to papers, tools,
workshops, etc. It is a great place to keep up with new developments on GPUs,
search for materials, or learn more about the many ways GPUs are being used.
It also includes a Wiki that has plenty of great information on getting
started with GPUs. More importantly, there are a number of tutorials linked
from the website that provide an introduction to GPUs. Alongside the
tutorials there is also a link to developer resources which provides basic
developer tutorials using lower level toolkits such as Cg. There are also
some sample codes and a couple of utility codes. If you are working your way
through the tutorials and have questions or if you have questions in general,
the website also has a Forums site where you can post questions.

There are other resources as well, but some of them are specific to a
particular graphics vendor. However, the materials are usually freely
available. NVIDIA has a very extensive developer section that supports
programming their GPUs for either graphics (games) or GPU programming. The
primary discussion about GPU computing is on CUDA website, however. It has
links to downloads, documentation, samples from the SDK, and a forum for
CUDA.

ATI also has a website that has information on programming its GPUs (from a
graphics perspective). The company also have a page devoted to Stream
Computing that serves a launch point for ATI's efforts into Stream Computing
in general. ATI also has a site to describe its current stream computing
processor called the AMD Stream Processor. You can also search the ATI
website for CTM (Close to Metal) which is a thin layer interface above the
GPU hardware. While it's not a compiler, it allows other developers to either
interface their existing tools to ATI's GPUs or to develop new tools.

There is a relatively new company, called RapidMind, that's developing
programming technology that can be used in a normal C++ environment
(compilers and IDEs). The RapidMind Platform is embedded in the application
and transparently manages massively parallel computations. The general
process to convert your code to use RapidMind is to identify the parts of the
code you want to run on the GPU (RapidMind also supports the Cell Processor)
and then perform the following steps:

   1. Convert the data types to use RapidMind's data types (integer and
floating point)

   2. Capture the computations (RapidMind can capture numerical operations
when the code is run which are then recorded and dynamically compiled to a
program object),

   3. Stream Execution (the RapidMind run-time manages the execution of
objects on the GPU)

RapidMind has some case studies that show how it works and what kind of
results you can expect. For example, there is a paper that compares the
performance of the BLAS SGEMM routine (single precision matrix multiply), a
single precision FFT, and European option pricing model (Black-Scholes) on a
GPU versus the performance on a CPU with highly-optimized code. RapidMind
compared the GPU routines with RapidMind to the optimized CPU routines using
both an ATI x1900 XT card and the NVIDIA 7900 GTX, against an HP workstation
running Intel Woodcrest chips and an AMD Opteron at 2.6 GHz. The researchers
found that the GPU code for SGEMM was 2.4 times faster than the CPU code; the
GPU FFT code was 2.7 times faster; and the Black-Scholes GPU code was 32.2
times faster than the CPU code.

Success Stories

There have been several success stories about using GPUs in computation. The
first one is a project called Folding at Home. The Folding at Home application is
downloaded and run on the user's desktop as a voluntary effort to help
researchers that have protein folding problems to solve. The application runs
a protein folding code to help with their research. On Sept. 26, 2006, ATI
announced that they had ported the Folding at Home application to use ATI GPUs.
ATI estimated that an ATI X1900 graphics card could get 20X to 40X more
performance than a standard CPU. The company also estimated that using ATI
graphics cards it would take only one month to finish what previously took
three years to solve.

Another success story is the Massachusetts General Hospital. At the hospital,
they take low-power X-rays that can be combined to form a real-time image
through a process called Digital Tomosynthesis. In the past the hospital had
been using a 35-node cluster to do the computations. Using NVIDIA G80 GPUs
they saw a performance improvement of about 100 times compared to standard
CPUs. This increase means the same task can be performed on a workstation,
moving the computational part of the process to the imaging unit.

These are just two success stories about GPUs. There are others. For example,
some researchers have come up with several LU (matrix factorization) routines
that run much faster on GPUs than CPUs. The LU routine is used by many CAE
(Computer Aided Engineering) codes to solve problems.

GPUs for Clusters

So far I've talked about efforts to run code on GPUs on a single cluster
node. GPU codes can be used in conjunction with the CPU, if the CPU is
running a parallel MPI job. Expect more in this area in the future. In the
mean time, GPUs have been used in clusters in other ways.

GPU clusters, or clusters where the GPU is the primary focus, are currently
being used as tiled displays. An example of this is the package called
Chromium, which performs parallel rendering. You can use Chromium to drive
multiple displays or to drive a single display using multiple compute nodes
equipped with GPUs.

Mike Houston at Stanford has been working with GPU clusters for some time. He
worked with a team on a parallel GPU version of a protein search code that
uses HMM (Hidden Markov Models). The team called its version ClawHMMer. The
HMM code was rewritten to run on GPUs and then modified again so that the
database searches are divided among multiple nodes in a GPU cluster. While in
general it's an embarrassingly parallel algorithm because the searches are
independent of one another, the overall search itself isn't parallel. The per
node performance is quite good (about 10-40 times the performance of a single
CPU) but more important, the parallel code scales very well (over 90 percent
scalability at 16 nodes).

The Visualization Lab at SUNY Stony Brook has been working with GPU clusters
for several years. The lab does research in both distributed graphics as well
as distributed computation using the GPUs. In 2004, the Visualization Lab
presented a paper about developing a Lattice Boltzmann Method (LBM) on a GPU
cluster to simulate the dispersion of airborne contaminants in the Times
Square area of New York City. The team added graphics cards (GPUs) to nodes
in a cluster and rewrote the LBM code to use GPUs as well as the network.
Since the GPUs don't have direct access to networking interfaces, they had to
transfer the data that is to be passed (either sent or received) to the CPU,
which then transfers it to the required node, which then transfers it to the
GPU on the node. Using this method, the programmers were able to get a
speedup of 4.6X compared to just using the CPUs.

Based on their experience with the LBM code as well as others, the
Visualization Lab has recently developed a object-oriented middleware for
clustered GPU programming called ZippyGPU. It sits above the low-level GPU
APIs such as OpenGL and Cg, as well as the communication API, MPI. It
encapsulates and manages the objects and functions provided by these APIs,
making coding much easier. Your code makes use of ZippyGPU's objects to
provide access to the GPU and to the network. ZippyGPU is designed to support
visualization, image compositing, as well as GPU based computations. The lab
has shown how to use ZippyGPU to write something like the GA toolkit for
combining the memory of the GPUs within the MPI run for large data arrays and
computing with them (or using them for visualization).

Acceleware is a company that is developing GPU hardware and software. They
market Accelerator Boards that are plugged into PCI-e slots to provide GPU
(Stream) processing for single machines, primarily workstations. They also
market workstations that use two or even four Accelerator Boards in a single
machine. As an example, the company has written a conjugate gradient solver
for sparse systems of equations. On a 32-bit problem, the code achieved a 3X
performance increase over current CPUs. They are are also developing
electromagnetic problem solvers that use GPUs.

Finale

GPUs offer great hope in increasing the performance of HPC codes. While
currently difficult to program and limited to "the right problem", the
performance gains are too large to ignore. Better tools and help are on the
way. And for all the gamers out there, next time you are considering a new
graphics card, think FLOPS instead of frags.

Dr. Jeff Layton has all of his degrees in Aeronautical and Astronautical
Engineering with a focus on topics that require HPC. He has been working with
HPC for over 20 years and has been working with clusters for over 10 years.
He has written on a variety of topics including clusters, MPI, file systems,
and performance tuning. He can be reached at laytonjb at gmail.com.

More information about the FoRK mailing list