by
Zsolt Kerekes
, editor - December 3 - 2001
see also:-
storage reliability |
| Throughout most of this
year Sun's reputation for hardware reliability was plagued by random faults in
some of their cache memory products. These problems caused a loss of confidence
in Sun's core competence as a vendor of reliable trouble free servers. A recent
Computerworld
article with Sun's CEO Scott McNealy seems to draw a line under the problem,
which Sun now blames on process or design problems in the high speed SRAM chips
its was buying from IBM to use in its cache. Sun claims that data bits in the
IBM supplied SRAM were being randomly flipped by alpha particles. |
 |
|
It's about 20 years since I last heard about alpha particles
affecting memory. In the late 1970's as chip makers prepared to solve the
problems of designing the then state of the art 64K (65,536 bits) DRAM
researchers at Intel Corp revealed that naturally occurring alpha particle
radiation had enough energy to flip a bit in the new design geometry. Alpha
particles occur naturally in air, and one source is granite rock. That's why if
you live in a rocky area, you are advised to ensure that your ground floor
rooms get plenty of ventilation, because the radioactive particles can be
carried up through the floor. Anyway a small amount of naturally occurring
radiation exists in most places, but unlike high energy gamma rays, the alpha
particle radiation is relatively easy to shield. So semiconductor companies back
in the 1970's were disturbed to find that their prototype RAM chips were being
affected by this radiation.
There were all kinds of scares at the time
about what this would mean for the computer industry, and one suggested
workaround was error correcting codes (ECC) memory, in which redundant logic was
designed onto memory boards which could detect and fix single bit errors on the
fly, and detect most double bit errors.
...Anyway, eventually chip
makers found that the material which emitted alpha particles was actually
occuring as a low level contaminent in the material which they used to coat and
protect the chips. Changing those materials solved the problem as far as the
electronics world was concerned for about 20 years.
Sun's temporary fix
to the problem in their cache was to use mirrored SRAM (ECC was not an option
because the logic delay penalty cancels out the speed advantage of using SRAM in
the cache). The real problem is probably a materials or process problem in the
semiconductor manufacturing chain. Once identified, these problems can usually
be fixed quite easily.
This problem affected thousands of users, and
one Sun customer wrote to tell us their company had replaced over 1000
UltraSPARC 2 400mhz cpus because of the ecache issue.
Could this
have been avoided?
About two decades ago, computer companies spent
more time testing and qualifying the new components they used in new systems,
and these kinds of problems would rarely have reached customers back in the
1980's. However increased competition has led to shorter delays between new
technology becoming available, and being shipped in volume in user systems.
Manufacturers now rely much more on computer simulations to get their basic chip
integration tested, and don't spend so much time doing physical testing of their
new systems. In the 1990's for example Intel shipped millions of flawed Pentium
chips with a floating point division bug. However, unlike Sun's cache problem
(which caused random faults), the Intel problem operated consistently and had a
software workaround.
Sad to say, we are going to see more of these
kinds of problems occurring in future systems from all vendors. My guess is that
Sun is now going to be ultra cautious about testing the new products it uses,
and that may account for some of the delays in getting new generations of faster
SPARC systems to market. But if they had employed a few more electronic
engineers with gray hairs in the design department, the classic symptoms might
have been identified a lot sooner and a lot of customers could have been spared
sleepless nights. |
|
| |
|
|
|
...Later...
in October 2003 - I
cited this problem as one of the factors which led to Sun's downfall in my
article:-
Are Sun's Days
Numbered?
in January 2002 - an in depth analysis of Sun's cache
memory problem was published in this article:-
Unsafe At
Any Speed? | |
| . |
Cypress and Avnet Cilicon
Net Seminar Tackles Soft Errors in Memory Devices
SAN
JOSE, Calif. - November 4, 2003 - Cypress Semiconductor and
semiconductor distributor Avnet Cilicon, today launched "A Hard
Look at Soft Errors," a free "view-on-demand" net seminar
addressing bit errors in memory device manufacturing.
"Soft
errors" are caused by the bombardment of alpha particles and cosmic rays
during manufacturing, and are an increasing source of device failures as process
linewidths shrink below 0.15 m. This seminar explores the causes of soft errors
and presents some steps being taken by memory manufacturers to counteract soft
errors and minimize their effects. Participants are introduced to the concept of
"Soft Error Rate," the metric used to quantify device susceptibility
to these errors, and process and design improvement techniques are considered.
The presentation also includes a comparison of soft error rates for 90-nm and
130-nm processes. It is presented by Ritesh Mastipuram an applications engineer
in Cypress's Memory Products Division, who has been involved with Cypress's Soft
Error Task Force, new product definition and system analysis for the past two
years
...view seminar,
...Cypress Semiconductor
profile | |
|
| |