CAs on Beowulfs (was Re: Linux vendors)

Date view Thread view Subject view Author view

From: Eugene Leitl (eugene.leitl@lrz.uni-muenchen.de)
Date: Fri Jan 28 2000 - 16:08:07 PST


Kragen Sitaker writes:
> Can we take this to the Beowulf list?
 
I don't know, for now it might be a bit off-topic there (of course
nothing is ever too off-topic for FoRK ;). We should specify the
project in more details first, if we want to move it to the 'wulf
list. I've long wanted to put my (metaphorical) money where my mouth
is, and a joint project like this (an OpenSource high-performance CA
package specifically optimized for Beowulfs) sounds like a very
worthwhile thing to do. I think we should be able to find at least 2-3
other contributors, e.g. I know Whygee here

   http://www.mime.univ-paris8.fr/~whygee/

has used Pentium assembly MMX for acceleration of lattice gas codes,
with very good results. There should be others, and not too hard to
find.
 
> Eugene Leitl writes:
> > [ 2d vs 3d volume/surface ]
> It may not be very interesting to you, but it's *very* interesting to
> me. :)
 
I presume you're aware of http://xxx.lanl.gov/

Have a gander at
  http://arXiv.org/find/nlin/1/abs:+AND+lattice+gas/0/1/0/all/0/1
specifically at
  http://xxx.lanl.gov/ps/cond-mat/9907298

(be sure to look at the figures). 3D is where the action is! (And if
connections to realword physics are tenuous, at least the visuals are
stunning ;)

> I am thinking about hooking all the nodes up to a full-duplex switch,
> and sending the whole simulation plane through a pipeline of all the
> nodes, once. Recirculating might make load-balancing easier, but it
> might also drive communication costs through the roof.

Whatever you say. Obviously I must rethink the problem, particularly
your suggestion of exchanging more than direct neighbour faces needs
further thought. It might be really useful for my problem... just
doubling the face buffer depth might do.
 
> Each stage board has 16 stages; it's possible to stick quite a few of
> them in a backplane and end up with hundreds of stages. At a hundred
> stages, you'd be doing a billion site updates per second, for a few
> hundred thousand dollars --- both faster and cheaper, for this task,
> than the CM-5 Connection Machine from Thinking Machines at LANL.
 
Have you checked out http://www.im.lcs.mit.edu/cam8/ ?

More distanly related, but also distinctly impressive:

     http://www.hip.atr.co.jp/~degaris/cam/index.html
     http://www.hip.atr.co.jp/~degaris/cbm/index.html
 
> ERIM did some very impressive things with the Cytocomputers at the time
> and for years afterwards, including automatically tracking white blood
> cells under a microscope, controlling robots to pick mail out of mail
> bins and parts out of parts bins, controlling railroad-spiking robots,
> analyzing photomicrographs of bumper surface defects to figure out how
> well paint would stick, and most recently, in 1997, automatically
> recognizing road lanes from video images for a prototype
> collision-avoidance system.
>
> The technology was undermarketed, I think; we had one machine in the

Ditto with CAM8, never sold well. Lunatic fringe iron.

> field at Eli Lilly counting bacteria at one point, but that was the
> only real field application this stuff ever saw. There were things we
> did in the lab ten years ago that people still can't do today.
 
There is hardly any dispute that cellular automata are highly useful,
and not only on dedicated hardware. All the way back to von Neumann,
CAs are chronically underestimated/underutilized.
 
> > This is an interesting thought. This assumes that information transfer
> > (exchange of vicinal tetragonal prism faces) is the bottleneck, and
> > that the data exchange has no impact upon the CPU. I'm thinking about
> > 4 NICs/node, which would saturate the CPU (and certainly the PCI bus)
> > fully if operated in full-duplex. You can't do much during the
> > transfer, and each volume takes essentially constant time to process.
>
> This puzzles me a bit. I guess you're talking about alternating
> communication and computation phases, which puts bigger peak demands on
> your communication hardware. Wouldn't it be cheaper to use fewer NICs
> per node and communicate at a more constant rate? 4 NICs means you'll

I guess I need to run a benchmark. Maybe exchanging twice the direct
neighbourhood depth I could decouple computation/communication. Of
course I need to double the transfer volume, which will also drain
CPU, so the speedup might be smaller than expected. Hard to predict a
priori. Benchmark...

> have to use either expensive four-port cards or expensive two-PCI-bus
> motherboards.

Oh no, there are plenty 5-PCI slot motherboards (and even 6-slot ones)
out there, and you can get Netgears for less than $15 apiece. Also,
the multiple NICs/node thing is actually more a scalability issue: you
can have infinitely scalable 3d torus topology (very much like Cray
T3D) if you use (cheap, unmanaged yet adaptive, good backplane
bandwidth since single ASIC) 8-port FastEthernet switches as
glue. Something like this (single elementary cell shown from above, N:
node, S: switch).

  | |
--N N--
   \ /
    S
   / \
--N N--
  | |

The elementary cell has one 8-port switch, 8 nodes with 4
NICs/node. The switch ports never connect directly with other switch
ports, and the loops lead through nodes, hence no problem. Moreover, I
don't think I'll run TCP/IP through the whole thing, but use some
userland library like M-VIA or GAMMA.

> If your communication time is insignificant compared to your
> computation time, why bother with all the iron? Conversely, if your

Why all the iron? We're talking $85 extra costs/node (switch: $200/8,
plus NICs $15*4). Compare this to a ~800 MHz Athon and 0.5..1 GByte of
memory, plus motherboard, harddrive and case. For this you get great
communication crossection (you can saturate the PCI bus and 60-80% of
CPU performance during communication bursts), and have an infinitely
scalable 3d lattice of nodes, with just the right symmetry (cubic
primitive lattice) for CAs -- remember that the data communication is
always purely local during one iteration.

> communication time is large, wouldn't it be cheaper to write the code
> to overlap?
 
As I said, I'll look into overlapping computation/communication
cycles. This is really a clever idea. Btw, are you aware of HashLife?
If you're lucky (stronly nonstochastic pattern distribution) you might
achieve spectacular speedups with just a bit of recursive lightcone
hashing. Might be well worth the extra hassle.

> Can you really spit data out at 400 megabits per second?
 
We're talking about 4 NICs full-duplex, these are actually 800
MBits/s. This is almost GBit Ethernet grade.

> Obviously whether communication is a bottleneck depends on your
> volume/surface ratio and the complexity of your rule.
 
Yes, and I don't know how complex the rule is really going to be. An
argon box (8 bits/degree of freedom) is certainly cheaper than water,
especially water with lots of electrostatics and polymers.

> > How complex is your rule, in machine instructions? How many bits has
> > your cell?
>
> Eight bits per cell; the rules vary in complexity from about two or

8 bits/cell, that's perfect for MMX.

> three machine instructions to much, much more. My prototype
> implementation of one common rule has an inner loop of 79 instructions,
> most of which only get executed if the old state is a particular
> 'empty' state. I expect I can probably cut that by half at least.
>
> Said prototype achieves between 1.7 million site updates per second
> (with little help from cache and most cells 'empty') to 7 million site

Alas, I cannot use such tricks. Even vacuum (I should have ~50% or
more of vacuum voxels) has properties, if you look at it from
electrostatics' point of view.

> Sounds like you're right --- and your problem will be ideal for a
> Beowulf. (Although maybe it would be even more ideal for Margolus's
> proposed machine: http://www.im.lcs.mit.edu/nhm/isca.pdf. He's
> projecting 1Tbit/sec of updates per chip, with 80 megabits of storage
> per chip.)

Ha! Embedded RAM hardware CAs! Fantastico.
 
> This is true when your cellular array is large compared to your number
> of processors. When your cellular array is fixed-size and your number
> of processors increases, your ratio of communication to computation
> increases, decreasing your efficiency.
 
That's true, but if you want to put a cubic micron of biology into
your box at full molecular detail, your cellular array is not
fixed-size, but how much you can cram in into the iron you have.
 
> If my cellular array is 512x512, it will be very difficult to get my
> code to scale linearly to 10^6 nodes no matter how I divide it. If I
> have to run it through 100 generations, I may be much better off with a
> temporal division among 100 nodes --- or, for that matter, a spatial
> division among 100 nodes --- than with any kind of division among 1000
> or 10 000 nodes.

True, but not if you think 256^3 voxels on each node, fixed. The only
way you speed it up is upgrading to new hardware/faster networking.
 
> Well, like I said, I'm not memory-bound now. My old 66-MHz-bus
> PPro-200 can run memset() to main memory at 80 megabytes per second,
> which is 4 times the rate I need to get 10 million cell updates per
> second.
 
I bet you can fix that with MMX, even making it memory-bound. Think
Athlon ~700, with MMX.
 
> > Perhaps I should go for a splatter, or similiar.
>
> What's that?
 
A splatter is a voxel renderer which streams through the voxelset in
memory order, hence profiting from burst. Here's one implementation of
it:

http://mpire.sdsc.edu/Splatter/Splatter_papers.html
 
Shear-warp is a better splatter, unfortunately it needs shared memory
and even then does not scale linearly. Check out Philippe Lacroute's
thesis for a nice review:

       http://www-graphics.stanford.edu/papers/lacroute_thesis/

> Memory bandwidth issues in my case will become relevant only when each
> output cell state takes three cycles or less to produce, on average.
> If I could use SIMD instructions, maybe I could take 12 cycles to
> produce four output cells instead.
 
Hey, we've got ~1 GHz machines now, and VLIW stuff (256 bit and
beyond) is in the pipeline. Playstation2 is a 1 kBit architecture,
gfx-wise.
 
> I don't think it's going to happen. I'd be happy with memory half as
> fast as what I have now.

For the time being, you're right. For the time being.
 
> > Not only that, recently people are going beyond CFD and have started
> > doing Molecular Dynamics type of stuff.
>
> I don't understand.

You've certainly known about lattice gas stuff for computational fluid
dynamics. Well, there is a company making a living selling a dedicated
      http://www.exa.com/newsite/PowerFLOW/exapapersdirectory.html
8 bit/degree of freedom integer lattice gas code for computational
fluid dynamics, which is exact. Others have used this for modelling
amphiphilic fluids, liposomes, micelles and such. It is obvious that
you can probably squeeze a decent Molecular Dynamics code into 8
bit/degree of freedom. 8 bit atom type, 3 bytes for position, 3 for
velocity. Double the last two for old, new, pull a crazy rule out of
the rulespace, voila, Molecular Dynamics on cellular automata. Add
dedicated hardware, mmh.


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Fri Jan 28 2000 - 16:08:51 PST