Annoying little error message you may find lurking in the vmcore file following a mysterious crash of your Sun Microsystems Enterprise server. This may or may not indicate an ecache problem on one of your cpus. It may indicate the presence of evil gerbils tapdancing across your system board - who knows? The good folks at Sun will advise you to wait around for the system to crash again on the same cpu. And then they'll replace it. At my place of employ this is cause for big laughs among the Windows NT Administrators, who are used to their crappy operating system blowing up on a daily basis.

This error message does in fact indicate an ecache problem on an UltraSPARC-II based server. Some little-known facts about the problem:

  • It affected all UltraSPARC-II based systems, though it (obviously) was more likely to appear on machines with larger numbers of CPUs and in CPUs with larger caches; the overall effect of this was that the problem did not become widespread until a lot of Enterprise class servers with CPUs with 8 MB of ecache were in the field.
  • It was a "soft error", in the sense that a CPU module that had produced the error once was no more likely to produce it a second time than any other module was to produce it the first time.
  • No one inside Sun has been able to determine a better explanation for the cause of the problem than cosmic rays. And actually, when you think about it, 8MB of non-error correcting SRAM is a pretty good cosmic ray detector.
  • There seemed to be a number of factors that made CPU modules more likely to have this error. In at least one case, a machine that had been experiencing the problem was found to be in a "hot spot" in the computer room, where a quirk of the HVAC system didn't provide enough circulation of cold air to keep the air exiting the machine within specification. After the machine was moved, the problems went away. In another case, moving the machine away from an elevator shaft caused the problems to stop occurring. No one is sure whether the contributing factor there was electromagnetic fields or vibration from the elevator. And, annoyingly enough, modules that had been installed in the field were more likely to exhibit the problem than modules installed in the factory. This led to situations where Sun was trying to explain to the customers that they'd be better off not replacing CPU modules that had failed and the customers were insisting on replacements.
  • The problem was more likely to occur on lightly loaded systems; this makes sense once you think about the fact that the cache entries on busy systems are invalidated and reloaded much more frequently.
The UltraSPARC-III CPUs from Sun have ECC protection on the ecache.

Note: I used to work for Sun as a systems engineer, which is where I got all this information.

Log in or register to write something here or to contact authors.