|
It's about 20 years since I last heard about alpha particles
affecting memory. In the late 1970's as chip makers prepared to solve the
problems of designing the then state of the art 64K (65,536 bits) DRAM
researchers at Intel Corp revealed that naturally occurring alpha particle
radiation had enough energy to flip a bit in the new design geometry. Alpha
particles occur naturally in air, and one source is granite rock. That's why if
you live in a rocky area, you are advised to ensure that your ground floor
rooms get plenty of ventilation, because the radioactive particles can be
carried up through the floor. Anyway a small amount of naturally occurring
radiation exists in most places, but unlike high energy gamma rays, the alpha
particle radiation is relatively easy to shield. So semiconductor companies back
in the 1970's were disturbed to find that their prototype RAM chips were being
affected by this radiation.
There were all kinds of scares at the time
about what this would mean for the computer industry, and one suggested
workaround was error correcting codes (ECC) memory, in which redundant logic was
designed onto memory boards which could detect and fix single bit errors on the
fly, and detect most double bit errors.
...Anyway, eventually chip
makers found that the material which emitted alpha particles was actually
occuring as a low level contaminent in the material which they used to coat and
protect the chips. Changing those materials solved the problem as far as the
electronics world was concerned for about 20 years.
Sun's temporary fix
to the problem in their cache was to use mirrored SRAM (ECC was not an option
because the logic delay penalty cancels out the speed advantage of using SRAM in
the cache). The real problem is probably a materials or process problem in the
semiconductor manufacturing chain. Once identified, these problems can usually
be fixed quite easily.
This problem affected thousands of users, and
one Sun customer wrote to tell us their company had replaced over 1000
UltraSPARC 2 400mhz cpus because of the ecache issue.
Could this
have been avoided?
About two decades ago, computer companies spent
more time testing and qualifying the new components they used in new systems,
and these kinds of problems would rarely have reached customers back in the
1980's. However increased competition has led to shorter delays between new
technology becoming available, and being shipped in volume in user systems.
Manufacturers now rely much more on computer simulations to get their basic chip
integration tested, and don't spend so much time doing physical testing of their
new systems. In the 1990's for example Intel shipped millions of flawed Pentium
chips with a floating point division bug. However, unlike Sun's cache problem
(which caused random faults), the Intel problem operated consistently and had a
software workaround.
Sad to say, we are going to see more of these
kinds of problems occurring in future systems from all vendors. My guess is that
Sun is now going to be ultra cautious about testing the new products it uses,
and that may account for some of the delays in getting new generations of faster
SPARC systems to market. But if they had employed a few more electronic
engineers with gray hairs in the design department, the classic symptoms might
have been identified a lot sooner and a lot of customers could have been spared
sleepless nights. |