The process of manufacturing silicon chips is shrouded in high tech equipment, confusing buzzwords and the sort of alien environments that necessitate such ridiculous attire as that lampooned in Intel Pentium II and Pentium !!! advertisements. But behind the mythos and the bunny suits, chip manufacturing is still a real world process, subject to real world compromises and practicalities. Wafer yield is perhaps the most important factor in the profitability of chip manufacturing.
Inside the Chip Shop
Chips are fabricated on large wafers of silicon, with many chips on each wafer. In simple terms, the wafer yield is the average number of functional chips produced per wafer (or expressed as a percentage of the number of chips laid out on each wafer).
Not every chip printed on every wafer will be functional. Imperfections and impurities in the silicon or other materials, or specks of dust or other debris which get past the clean room procedures and air filters will create localised defects which can render that part of the chip useless, or allow it to function, but more slowly or with higher power loss than intended.
Those chips which fail entirely often end up as commemorative souvenir keyrings for chip company employees.
Of course, since these problems are localised, a single such defect will only affect a single chip; the rest of the wafer may yield perfect chips. The chance of a critical defect appearing in any particular chip is approximately proportional to the area of the chip; complex chips like x86 compatibles have large die areas, and thus poorer yields than smaller chips like PowerPCs.
Features of a chip design can have an impact on whether a defect is likely to be critical. Most fundamentally, the feature size of the silicon process: the larger the transistors and interconnects on the chip, the larger the defects that they can tolerate without creating a broken connection or a short circuit. Redundancy and fault tolerance are design features that are incorporated specifically to improve yield.
A good example of a chip component which can easily support redundancy is a memory array. Since memories (caches in processors, buffers in peripheral interfaces and network chips) tend to account for large amounts of the surface area of chips, there's a high probability of a critical defect appearing in a memory array.
Fortunately, since memories have a uniform and regular structure, it's easy to provide some redundancy: provide more memory than is needed, and if part of that memory is defective, use a 'spare' chunk of memory to replace that. This will typically be done as the chip is tested, and the faulty areas identified.
If the probability of a critical defect occurring in a memory object is small, then the probability of two critical defects occurring is very small; hence the majority of chips that would be lost due to memory defects can be recovered, at the cost of a little extra die area in each chip. Like all practicalities, it's a trade off.
Perhaps the best example of a design with some degree of fault tolerance comes to us from Intel. The basic idea is that if a part of the chip is defective, it may be possible simply to do without it. This is what Intel did with their 486SX and 486DX product lines. The 'difference' between the two units was that the SX had no hardware floating point support, whereas the DX had a built-in FPU. Or that's what they told the customers.
In actual fact, the two chips were cut from the same die. The FPU was a large unit, with critical timing and delicate construction (for best performance). As such, it was significantly more susceptible to defects than the rest of the chip. To discard an entire chip because of a fault in the FPU would lead to very poor yields. Instead, the chip was designed so that the FPU could be disabled if it was found to be defective during testing, and the chip classified as an SX part and sold at a slightly lower price than a DX part. In this way, the yield cost of the FPU was almost completely removed from the 486 design budget.
A closely related topic is that of speed grading. Minor defects will affect the speed which a chip can run at; a very pure wafer, on a good day with low pollen count can produce a 'best of breed' chip, or the cream of the yield, whereas the same design printed badly on a less pure wafer will only run significantly more slowly.
The chips are speed graded during testing; that is, for each individual chip, the best speed at which it will reliably run (inside the design parameter envelope of interference and heat dissipation) is determined, and that's the speed it's sold as. Pentium 166 MMX and 200 MMX, for example, were the same design; the 166MHz parts were just those that failed to make the grade at a faster speed.
This is how overclocking happens to work; the manufacturers mark a part as low speed because it won't run faster without consuming too much power and overheating; the overclocking crowd can pick these up cheaply, and use additional cooling equipment to allow the chip to operate beyond it's thermal spec, and therefore run quicker than it would under testing conditions.