This Monday, Linux kernel creator Linus Torvalds went on a frustrated run about the lack of Error Correcting Checksum (ECC) RAM in consumer PCs and laptops.
… the misleading and recurring policy of “consumers do not need ECC”;, [made] the market for ECC memory disappears.
The arguments against the ECC were always complete and utter rubbish. Now even the memory manufacturers are starting to do ECC internally because they finally owned the fact that they absolutely must.
If you are unfamiliar with ECC RAM, it’s probably because you are not building or specifying dedicated servers using CPUs and server-class motherboards – which unfortunately is about the only place you can actually find ECC. In a nutshell, ECC RAM includes a small amount of extra memory that is used to detect and correct errors.
Memory error and probability
In most modern implementations, for every 64-bit word stored in RAM, that means eight control bits. A single bit error – a 0 turned to 1, or a 1 turned to 0 – can be both detected and corrected automatically. Two bits inverted in the same word can be detected but not corrected. Three or more pieces inverted in the same word will probably be detected, but detection is not guaranteed.
Bitflips can happen for many reasons, starting with cosmic radiation or simple hardware failure. A large-scale study of Google’s servers found that approximately 32 percent of all servers (and 8 percent of all DIMMs) in Google’s fleet experience at least one memory error per year. But the vast majority of these are single-bit errors – and since Google uses server CPUs and ECC RAM, this means that the machines in question are running a truck.
In consumer machines, even these single-bit errors – which are over 40 times more likely than multi-bit errors, according to Google data – are undetected and can introduce system instability and data corruption.
Bitflips are not always random
Not all RAM errors are the result of hardware failure or accidental EMF problem. In recent years, researchers have developed increasingly practical physics-based side channel attacks, using controlled, fast bit flips in RAM areas that are available for an application to derive or modify the data values in adjacent RAM areas that they should not be able to.
Although ECC RAM cannot attenuate RAMBleed style attacks that derive the values to adjacent memory, it can usually stop Rowhammer attacks, where the rapid rotation of bits in one RAM area causes bits in an adjacent area to change.
Even when the ECC cannot actively prevent a Rowhammer attack from affecting the system – for example, when it turns several bits into one word – it can at least alert the system to the problem and in most cases prevent the Rowhammer attack from doing anything other than cause downtime. (Most ECC systems are configured to stop the entire machine if a bug that can be fixed is detected.)
Torvalds blames Intel
And the memory manufacturers claim that it is due to economy and lower power. And they’re lying bastards – let me once again point out to Radhammer how these problems have existed for generations already, but these f * ckers often sold broken hardware to consumers and claimed that it was an “attack” when it was always “we cut corners. “
How many times has a bit-flip row hammer just happened out of sheer bad luck on real non-attack loads? We never know. Because Intel pushed shit to consumers.
Torvalds takes the bold stance that the lack of ECC RAM in consumer technology is Intel’s fault due to the company’s policy on artificial market segmentation. Intel has an interest in pushing deeper pockets towards its more expensive – and profitable – server-quality processors instead of letting these devices effectively use the necessarily lower margin consumer parts.
Removing support for ECC RAM from CPUs that are not directly targeted at the server world is one of the ways Intel has kept these markets highly segmented. Torvalds’ argument here is that Intel’s refusal to support ECC RAM in its consumer-targeted parts – along with the de facto almost monopoly in that space – is the real reason why ECC is almost unavailable outside the server area.
The usual argument about why ECC is not present in consumer technology is about cost, but we suspect that Torvalds is entitled to this. Despite the fact that ECC RAM is essentially a difficult to find specialty part, it usually costs only around 20 percent more per DIMM than non-ECC does in retail. The real problem is that without motherboards and CPUs that support it, it will not do you any good.