the fastest SSDs - click to read article
the fastest SSDs ..
RAM image - click for RAM directory, articles and news
RAM in the SSD era ..
hybrid DIMMs
hybrid DIMMs ..

SPARC Product Directory

from the makers of StorageSearch.com
sparc search

Looking Back at Sun's Cache Memory Problem

by Zsolt Kerekes , editor - December 3 - 2001

see also:- fault tolerant SSDs, rethinking RAM architecture, adaptive ECC for flash
Throughout most of this year Sun's reputation for hardware reliability was plagued by random faults in some of their cache memory products. These problems caused a loss of confidence in Sun's core competence as a vendor of reliable trouble free servers. A recent Computerworld article with Sun's CEO Scott McNealy seems to draw a line under the problem, which Sun now blames on process or design problems in the high speed SRAM chips its was buying from IBM to use in its cache. Sun claims that data bits in the IBM supplied SRAM were being randomly flipped by alpha particles.

It's about 20 years since I last heard about alpha particles affecting memory. In the late 1970's as chip makers prepared to solve the problems of designing the then state of the art 64K (65,536 bits) DRAM researchers at Intel Corp revealed that naturally occurring alpha particle radiation had enough energy to flip a bit in the new design geometry. Alpha particles occur naturally in air, and one source is granite rock. That's why if you live in a rocky area, you are advised to ensure that your ground floor rooms get plenty of ventilation, because the radioactive particles can be carried up through the floor. Anyway a small amount of naturally occurring radiation exists in most places, but unlike high energy gamma rays, the alpha particle radiation is relatively easy to shield. So semiconductor companies back in the 1970's were disturbed to find that their prototype RAM chips were being affected by this radiation.

There were all kinds of scares at the time about what this would mean for the computer industry, and one suggested workaround was error correcting codes (ECC) memory, in which redundant logic was designed onto memory boards which could detect and fix single bit errors on the fly, and detect most double bit errors.

...Anyway, eventually chip makers found that the material which emitted alpha particles was actually occuring as a low level contaminent in the material which they used to coat and protect the chips. Changing those materials solved the problem as far as the electronics world was concerned for about 20 years.

Sun's temporary fix to the problem in their cache was to use mirrored SRAM (ECC was not an option because the logic delay penalty cancels out the speed advantage of using SRAM in the cache). The real problem is probably a materials or process problem in the semiconductor manufacturing chain. Once identified, these problems can usually be fixed quite easily.

This problem affected thousands of users, and one Sun customer wrote to tell us their company had replaced over 1000 UltraSPARC 2 400MHz CPUs because of the ecache issue.

Could this have been avoided?

About two decades ago, computer companies spent more time testing and qualifying the new components they used in new systems, and these kinds of problems would rarely have reached customers back in the 1980's. However increased competition has led to shorter delays between new technology becoming available, and being shipped in volume in user systems. Manufacturers now rely much more on computer simulations to get their basic chip integration tested, and don't spend so much time doing physical testing of their new systems. In the 1990's for example Intel shipped millions of flawed Pentium chips with a floating point division bug. However, unlike Sun's cache problem (which caused random faults), the Intel problem operated consistently and had a software workaround.

Sad to say, we are going to see more of these kinds of problems occurring in future systems from all vendors. My guess is that Sun is now going to be ultra cautious about testing the new products it uses, and that may account for some of the delays in getting new generations of faster SPARC systems to market. But if they had employed a few more electronic engineers with gray hairs in the design department, the classic symptoms might have been identified a lot sooner and a lot of customers could have been spared sleepless nights.

SSD ad - click for more info
SPARCproductDIRectory.com
DRAM (in 2016) has stayed stuck in the Y2K era of enterprise latency and that's why its future will go the same way as the 15K hard drive.
latency loving reasons for fading out DRAM
in the virtual memory slider mix
..

...Later...

in October 2003 - I cited this problem as one of the factors which led to Sun's downfall in my article:- Are Sun's Days Numbered?

in January 2002 - an in depth analysis of Sun's cache memory problem was published in this article:- Unsafe At Any Speed?

.
SSD ad - click for more info
.
Cypress and Avnet Cilicon Net Seminar Tackles Soft Errors in Memory Devices

SAN JOSE, Calif. - November 4, 2003 - Cypress Semiconductor and semiconductor distributor Avnet Cilicon, today launched "A Hard Look at Soft Errors," a free "view-on-demand" net seminar addressing bit errors in memory device manufacturing.

"Soft errors" are caused by the bombardment of alpha particles and cosmic rays during manufacturing, and are an increasing source of device failures as process linewidths shrink below 0.15 m. This seminar explores the causes of soft errors and presents some steps being taken by memory manufacturers to counteract soft errors and minimize their effects. Participants are introduced to the concept of "Soft Error Rate," the metric used to quantify device susceptibility to these errors, and process and design improvement techniques are considered. The presentation also includes a comparison of soft error rates for 90-nm and 130-nm processes. It is presented by Ritesh Mastipuram an applications engineer in Cypress's Memory Products Division, who has been involved with Cypress's Soft Error Task Force, new product definition and system analysis for the past two years ...view seminar, ...Cypress Semiconductor profile
.
SSD ad - click for more info

SPARC(R) is a registered trademark of SPARC International, Inc. SPARC PRODUCT DIRECTORY(SM) is a service mark of SPARC International, Inc used under license by ACSL. Products using the SPARC trademarks are based on an architecture developed by Sun Microsystems, Inc.