We live in a world where we expect everything to be available 24/7. We want to be able to shop online, register for classes, watch movies whenever and wherever we want. This means many businesses now need to be running 24/7 or at least design the services that they provide so that they can be available 24/7. But we also live in a world where life doesn’t always happen as planned. A case in point would be the entire last year, which seemed to bring forth every possible disaster except for a zombie apocalypse (although the CDC does have zombie preparedness guidelines about how to prepare for this just in case).
Many organizations learned over the last year how important it is to have a business continuity plan (BCP) in place. A BCP can address many areas, such as who to contact in case of a disaster, safe locations to meet up, etc. BUT a key part of any BCP is ensuring that your business systems such as email or e-commerce sites remain available no matter what. This means you need to design your systems or purchase services that are designed with “high availability and reliability” in mind. But what do “high availability” and “reliability” really mean?
High availability can be defined as the percentage of time that the system remains operational, or is “generally available,” to serve its intended purpose (essentially the absence of downtime). Whereas reliability is related to the probability that a given system will yield correct output, sometimes within defined performance standards. Reliability is measured in terms of percentage of successful requests.
How are Reliability and Availability Calculated?
Cloud-based companies often cite a “reliability” figure of four-nines or five-nines. This refers to a percentage of 99.99% or 99.999% uptime. Which actually makes it an availability figure not a reliability figure since it is measuring the percentage of time the system is operational or “generally available” NOT the percentage of time that users were successful in sending requests to the service. This means they are claiming to only be unavailable 52.58 minutes or 5.26 minutes out of the year. Sounds wonderful!
Except if you go to their service status page, you will often see several outages or disruptions throughout the year, and if you add up the total time that it took to resolve those outages it would be well over 52 minutes.
So how can they make the claim of 99.99% uptime or better? Well, they are basing it off of that “generally available” part of the definition of availability. Their whole system might never have been down for more than 52.58 minutes or perhaps 5.26 minutes throughout the year, but certain services or services in certain regions of the world could have been down for a longer period. So the claim of 99.99% or 99.999% availability does not mean that all users were able to successfully access the services 99.99% or 99.999% of the time.
Need for Change
We need to change how we are calculating those four- or five-nines. What matters is a user’s experience. So what should matter is not whether or not the system was up but whether or not a user was successful in sending their request to the system. In the Identity and Access Management (IAM) space, we should measure this in terms of End User Login success no matter how they are logging in or what protocol they are using. Numbers like 99.999% uptime have no meaning if they do not match the user’s experience and we all need to be designing with our users in mind.