A component in a computer system which will, if it fails, stop the system from working. Building high availability clusters mostly consists of making sure you have no single point of failure, though sometimes some will be tolerated - e.g. by having only a single passive backplane for a system, on the assumption that this has such a low likelihood of failing that it is a non-problem.

A system with no SPOFs can still be vulnerable to catastrophic failure.

What are single points of failures? Consider a computer with one hard disk. If that hard disk fails, you lose all your data. That's a SPOF. Let's put two disk in your computer and mirror them (ie both disks contain the same data). Now, if one disk fails, you won't lose your data - there you go, your system is already more stable. However, what about your disk controller? If you only have one, then you've got another SPOF. So now you have to add a second disk controller. What about your power supply? You'll need multiple power supplies if you every want to get close to achieving zero downtime. What about your network connection? Better have multiple network cards in your server and make sure that the cards are connected to separate subnets so that if a router fails on one, it won't affect both your cards etc etc

Single points of failure are basically areas where if they failed would cause you to experience downtime and/or data loss.

As you can see, trying to achieve zero downtime on a system is very difficult (and expensive) to achieve. I work in an environment where zero downtime is the goal. Our servers have multiple mirrored disks, multiple disk controllers, multiple power supplies, multiple network cards, multiple network routes, the buildings have multiple network connections entering the building from opposite ends and to top it all off, we have fail-over systems so that if one of the main systems dies, all network traffic is automatically switched to the fail-over system with no impact to the client.

Naturally, all this costs a stack of money to achieve, but when downtime can result in up to $50 million in lost transactions, this kind of redundancy is cheap in the long run.

Log in or register to write something here or to contact authors.