‘High Availability’ 101 – Redundancy

High Availability (HA) is just one of those topics. Ask five people to define HA, and you’ll probably receive five different answers. I’m not immune to this myself, but I’d like to explore two universal attributes I believe all highly available systems share. The topics will be split across two blogs. This time we’ll talk about redundancy, and in the next installment, replication.

Availability is the probability a system can fulfill its function on behalf of the user (be it human or another component) at any given point in time.  For example, if your system is 90% available, the user should expect to receive service nine times out of ten.  The quantification of ‘high’ in ‘high availability’ is arbitrary depending on the context, but we are usually talking high 90s, and even counting the number of trailing decimal 9s on 99% for certain applications.

Availability is associated with two other measurements: reliability and maintainability. Reliability measures how often failures occur, and maintainability measures how easily (or quickly) a failed component can be repaired. It’s possible to have an un-reliable but highly available component if maintainability is extremely high. For example, a component could fail all the time, but if you always bring it back up very quickly, it may look like it never failed at all. Likewise, maintainability can be poor if reliability is high enough to make up the difference. If the system never fails, you never have to try and repair it.

Neither of these two extremes are usually practical. It’s more common to expect a certain level of failure, and simulate high maintainability through redundancy. If a component is redundant, the service it provides can appear to remain available in the face of failure. The redundant component takes the place of its failed brethren (the act of fail-over) maintaining service to the user, while time is spent fixing the failed component. With each redundant copy, you decrease the possibility there will be an overall interruption of service.

Of course there are many issues, trade-offs and limitations to this approach. Redundancy also has implications for increasing throughput via parallelization. My point is not in the details, but the concepts.  If we are talking high availability, and enterprise systems, we are talking redundancy. Look for its location when you analyze an HA system. It may be hidden under several layers, but it’s there.

Stay tuned, my next post on Monday will explore making data access highly available through replication.

Leave a Reply

Your email address will not be published. Required fields are marked *