February 2011
Is Sandy Bridge a Bridge too far?
By Rick Smith
rants@vcmail.net
For those of you Library buffs "A Bridge Too Far" was a book written in 1974 by Cornelius Ryan and turned into an epic 1977 war film by the Levine brothers and directed by Richard Attenbourgh. The name for the film comes from an unconfirmed comment made by British Lt. Gen. Frederick Browning, deputy commander of the First Allied Airborne Army, who told Field Marshal Bernard Montgomery, the operation's architect, before the operation, "I think we may be going a bridge too far."
This has now turned into a cliché for "biting off more than you can chew." It seems that that you could say this about Intel Corp. these days but is that what really happened? I'll try to explain in my own immutable style. Intel just released its new series of processors and motherboards. The processors are an improved version of its i3, i5 and i7 line with new features such as better imbedded graphics and unlocked cores for easy overclocking. I will save analysis of that for another day. Their motherboards were code named SANDY BRIDGE. They featured the new SATA 3.0 with a 6GPbs and USB 3.0. They also, of course, changed the socket to LGA1155, which means new CPUs for all. Intel announced that it has identified a bug in the 6-series chipsets SATA controller.
Intel states that "In some cases, the Serial-ATA (SATA) ports within the chipsets may degrade over time, potentially impacting the performance or functionality of SATA linked devices such as hard disk drives and DVD-drives." This is a hardware issue and requires a replacement board if you already have one. This cannot be fixed by a patch or firmware update. It is also only confined to the SATA II ports.
The motherboard has four SATA II ports and two SATA III ports, which remain unaffected by the problem. Intel has already ordered a complete recall of the affected boards and will begin shipping the fixed version of the chipset in late February. The recall will reduce Intel's revenue by around $300 million and cost around $700 million to completely repair and replace affected systems. Ouch! Here's how everything has come about.
Intel has been testing its 6-series chipset for months now. The chipset passed all of its internal qualification tests as well as all of the OEM qualification tests. These are the same tests that all Intel chipsets must go through, testing things like functionality, reliability and behavior at various conditions (high temps, load temps, high voltage, low voltage, etc. ...). The chipset made it through all of these tests just fine.
There are two general types of problems you run into in chip manufacturing. The first is an engineering oversight: functional problems that will cause a failure during your validation tests. The second type of problem is more annoying, it's a bug of a statistical nature. In these situations, the problem doesn't appear on every chip in every situation, but on every nth chip out of every x chips. When a bug doesn't present itself in small quantities, it's very difficult to track down. This is the nature of the 6-series chipset bug, and it's also why the problem didn't appear sooner. Intel mentioned that after it had built over 100,000 chipsets it started to get some complaints from its customers about failures. Recently Intel duplicated and confirmed the failure. Intel decided to recall all boards, and it halted production of its 6-series chipsets.
Recalls are costly and cost you sales if you don't have replacement products readily available. Then, of course, you pay a price in handling the job of doing it. Just go ask Toyota, if you know what I mean. It's estimated that Intel will lose a billion dollars due to this. That's the big B, folks. And you thought only Congress threw that kind of money around. I would have loved to be a fly on the wall in the engineering department when they figured this thing out and said "Oh S#^@, we're in for a lot of trouble!" So what is it that really happens in laymen's terms? Intel says you'd see an increase in bit error rates on a SATA link over time. Transfers will retry if there is an error but eventually, if the error rate is high enough, you'll see reduced performance as the controller spends more time retrying than it does sending actual data.
Ultimately you could see a full disconnect — your SATA drive(s) would no longer be visible at POST or you'd see a drive letter disappear in Windows. Intel said that it hasn't been made aware of a single failure seen by end users. Intel expects that over three years of use it would see a failure rate of approximately 5 percent to 15 percent, depending on usage model. Remember, this problem isn't a functional issue but rather one of those nasty statistical issues, so by nature it should take time to show up in large numbers (at the same time there should still be some very isolated incidents of failure early on).
Needless to say, Intel wasted no time getting the word out to its vendors, such as yours truly, with the bad news. I had just cleared out my entire inventory and was ready to start stocking the new product so fortunately I had none of the affected product in stock. So I'll keep selling what I've been selling (as long as there is product available). Oh, well. At least I could always get a job at Intel thinking up new names for its products. I mean, really, doesn't Pentium sound like toothpaste. And all of Intel's previous code names for projects sound like Civil War battlefields: Nehalem, Bloomfield, Lynnfield, Clarkdale and the list goes on. So, how's about this instead? Jackknife, Upendbend, Soggybottom, Rumpelstilskin, Emptynest and so on. So, did Intel extend itself by putting something on the market too soon? You be the judge. I think it's good to know nobody's "too big to fail."

