Re: Handel-C (compile to wire, not code)

Date view Thread view Subject view Author view

From: Eugene Leitl (eugene.leitl@lrz.uni-muenchen.de)
Date: Sun Oct 08 2000 - 13:33:24 PDT


Tony Finch writes:

> Hmm. I'm not convinced. A lot of the problems with defects that you
> mentioned are already being addressed by DRAM manufacturers including
> spare capacity on the die along with a mechanism akin to bad block
> remapping on hard disks. Like RAM, the structure of FPGAs is also very

I would do it in software. Periodically make memory test sweeps in the
background, and mark bad words as permanently allocated by
memory.bad() object.

> regular so they could use the same technique.
 
Sure, but FPGA doesn't scale. You're faking total connectivity on an
immutable substrate. Doing it with a crossbar is totally inefficient,
and would broaden the delay spread as well as increase average
latency, so you make little crossbars in the cells, and then crossbars
between the cells. The connectivity is faked, because you don't have a
straight and honest piece of copper between adjacent pieces of logics
so signal delays go up, and the machinery for virtual connectivity
bloats the die, reducing functionality concentration and increasing
the average intergate distances.

> >* as switching speed goes up, relativistic latency will start
> > constaining on-die accesses to immediate vicinity
>
> Pipelining handles this problem at the moment, albeit not in a
> comprehensive way.
 
Intrinsic assumption is that you have code locality. As soon as you
have code written by a statistical process, this is no longer true,
and you could as well forget the pipeline, because you will wind up
flushing it quite often.

Clearly off-die accesses are a bottleneck, because you have a limited
number of (expensive in respect to packaging) pins, not to mention
bonding pads and drivers. It takes forever for a tiny DRAM cell to
bring up a macroscopic piece of copper on the motherboard to high,
because it can't pump charge worth shit even while talking to the
on-die pin driver. So you've got a double bottleneck: number of pins
limiting the word size, drivers increasing total wattage burned and a
stiff latency penalty. By now the number of resources on the CPU die
allocated to deal with the problem are considerable. If you chuck out
BPU, pipelining, caches and MMU, you've got space for several MBytes
memory on-die, which can be accessed in ~kBit words in a temporally
flat manner, with a latency between that of a register and a cache
access. Sounds good, eh?

The sane solution to this would choose the die size as to make the
yield quantitiative and to make it cheap, and allocate the resources
between storage, CPU and packet-switched serial networking. You'll
wind up with RAMs with built-in CPUs which only need juice and a few
fast ultralocal serial packet switched buses (since we only need to
talk cm distance, you could get >>10 GBps via one such pipe, and with
the right hardware protocol message passing latency will be in <<100
ns).

I do not see why you would need more than 0.5 MTransistors for a ~1
kBit broad ALU CPU plus switch. The rest of the die could be used for
DRAM cells. If you make the dies small enough, and the networking
redundant and self healing, you can simply forego cutting up the wafer
into dies. Depending on the die size, the process and the phase of the
moon, 30-70% of the wafer would contain dead dies (and that number
will get up during effective life time due to failures), which you
could software test during production (use a worming test program, by
booting a random die from the link, it worms through the wafer and
reports you the number of bad dies, and their location. you cut off
the bad dies from the power grid with a laser, to reduce the total
power dissipation).

The reason why this is not being done: 1) embedded RAM processes are
very new, and SRAM cells take up 4-6 times the space of a DRAM cell,
so that makes for very small memory grains. 2) the resulting
architecture (100 of CPUs with few ~MBytes memory grains and hardware
message passing) would weird out 99% of software developers, and we
can't yet import H1Bs from Mars or the galactic halo.

> >* active signal propagation knows no fanout problems nor signal
> > degradation (see biology)
>
> CPUs already use amplifiers on long paths, but you don't want to stick
> too many of them on your signal path because you'll make the
> propagation delays much worse!

The only effective solution to this is pure access locality. Long
range accesses will be considered as iterations of local accesses. A
spike made from special states AB propagating along a wire written in
a yet another magical state X.

> But for the longer term you are right, of course. I just hope that an
> accidental off-by-one error in some grey goo doesn't cause an infinite
> loop and the consequent consumption of the whole planet by nanoware :-)

Sooner, or later it will happen anyway, albeit hopefully in a
controlled process. Since industrial nano will be designed brittle,
requiring a number of essential ingridients and a very special
environment (UHV or a noble gas atmosphere), you're talking about a
deliberate weapon design.


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Sun Oct 08 2000 - 05:20:42 PDT