Reliability
When business SaaS service (product) is unavailable, the business is affected. For some services, it is not so serious (you may not be able to enter and process your recent receipts and would need to wait a bit), for some it is critical (you can’t send or receive emails).
In our industry, broadly “Identity and Access Management,” we sell a product to organizations who configure and customize it, then roll it out to their own end-users. These end-users then rely on our product to access ALL of their business applications. We are the gateway to your office suite, your HR systems, your CRM solutions, and everything else. And consider, we aren’t just providing access to Zoom and Slack – we are providing access to retail point-of-sale systems, manufacturing shop floors, health care systems for providing patient care and dispensing medication, etc.
If the ability of end-users to gain access to these systems goes down, all of our customer’s employees notice it, typically including the CEO (who can’t access their applications either). Our customer’s internal help desks light up. They spin up their own incident response teams (waking people up in the middle of the night sometimes). It’s a very big deal. It seriously impacts our customer’s business, sometimes bringing it to a halt. It impacts our reputation and standing with the customer.
That is why, at OneLogin, reliability is the second most important “feature” of the product (second, only because we are security-first). We are all dedicated to achieving super-high reliability and the whole organization is aligned around this goal.
But wait, what exactly is “This Goal”?
Let me first start with the story.
The Story You Already Know
Company is started. Product launched.
It grows. It grows faster.
Product and code is growing, the controls are lacking.
Everything is natural, disorganized, “startup-y.”
Design wheels start to squeak. It is more difficult to release without a bug.
The system is stressed with an increasing number of users.
Things start to randomly fail.
Managers start to talk about “processes,” “controls,” “non-functional requirements.”
Engineers start to talk about “technical debt,” “legacy,” “redesign” or “rewrite.”
I have seen variations of the same story in each and every successful company.
So what now?
Rewrite everything, right? 🙂
Let’s Improve
Ok, no rewrite then. We need to improve, iteratively. But:
“If you can’t measure it, you can’t improve it.”
PETER DRUCKER
Peter was smart. Much smarter than me.
Well, so we can’t improve until we know what to measure and how to measure….
The Metric
Let’s measure the reliability!
It may be tempting to start measuring and setting goals around everything, but that is a bad idea. It distracts focus from the most important metrics and goals, dilutes resources and often results in paralysis by analysis.
Our goal was to come up with a single, objective metric that best represented our customers’ view of the quality of service we were delivering. Here’s the result:
One Metric: End-User Login Success
Definition: An end-user is able to sign into OneLogin or to a third-party app via OneLogin.
Details: This is our key reliability metric which covers end-user login success, including any requests to OneLogin on behalf of an end-user attempting to access OneLogin, authenticate to OneLogin, or authenticate to or access an app via OneLogin, whether via the OneLogin UX, supported protocol, or API.
Note: Our product has many other functions besides supporting end-user access to applications. We have a large administrative console that is used to configure, manage, and monitor the product. We have many housekeeping tasks that we run on a continuous basis, including directory synchronization and user provisioning to applications. While these other functions are important, they are not as sensitive and critical as the real-time end-user access.
How
We decided to measure HTTP response codes and group them in two categories:
- Failures – all 50x response codes, such as 500 (Internal Server Error), 502 (Bad Gateway), 503 (Service Unavailable), 504 (Gateway Timeout)
- Successes – everything else
The metric is then the number of successful end-user requests (non-50x) divided by the number of all end-user requests (in the measured window).
This is relatively simple, yet close enough approximation of end-user experience.
Step two was to decide where to get the data for the metric.
We have considered three approaches:
- Collect data from the e2e-monitor (our e2e-monitor runs every 90 seconds a suite of end-to-end tests that execute all major use-cases against all OneLogin regions – more on the e2e-monitor in one of future blog posts)
- Collect data from frontend services (“frontend” services are the services that ingest requests from outside our infrastructure)
- Collect data from our edge proxy
The advantage of the e2e-monitor approach is that the test suite is really perceived from the end-user perspective, so it truly reflects the full end-user experience (including, for example, bugs in our browser code or browser retries). On the other hand, even with extensive coverage of all the use-cases with our tests, the tests would still represent only a very tiny portion of requests and we would never be able to get to the level of precision we needed.
Frontend services have the most information about each request and could add more details about users, accounts or i.e. correlate whole login flows. But the implementation (or any necessary update) would be spread across many services and would need additional effort. Also, it would not catch connectivity or routing problems between the proxy and frontend service.
Therefore, we have decided to go with the edge proxy approach – all the data would be coming from one service (proxy) and the information we have had about each request seemed to be enough for our goal.
Step three was to decide how to narrow down and correlate “end-user” traffic. We started defining end-points (and paths) that end-users use and then continued to fine tune that set via combination of http method, endpoint and path, until we had a very close representation of all end-user interactions.
For illustration, here’s part of the current (as of 6/2021) metric calculation, after many iterations:
The Goal
The last step was to decide on the target goal.
We wanted to eventually achieve 99.999% of reliability (five nines), but that looked like a utopia back in January 2020, when we started with this effort.
So we decided to first hit 99.99% (four nines) by the end of 2020 – which was still a very ambitious goal.
We also set up a monthly CEO review meeting (we are very lucky to have a CEO who is a big believer of technical excellence as a driver of customer satisfaction and overall business), where we reported our track record, challenges and all the improvements we were doing in order to get to the goal.
The Journey
So how did we do?
First attempt
This is a screenshot from our January 2020 review.
We have just defined the early metric and initial set of endpoints. We did not have any dashboard, so we dug in logs from the edge proxy, ran some queries, used Google Sheets and manually calculated the data from December and January (month-to-date) to understand what we need to automate and what our current number is.
First dashboard
February 2020, we got our first dashboard. We continued fine tuning the “end-user flows” coverage and also started measuring the metric not only per shard (we have US and EU shards), but also specifically to the first two key services in order to start reporting specifics to the service owners.
Fun fact: it took this early dashboard 2-3 hours to calculate 30 days of data on each refresh :-). But we had SOMETHING!
Introducing Error Budgets
In March 2020, we introduced the concept of Error Budgets from the Google Site Reliability Engineering (SRE) book in order to clearly communicate needed improvements and have common incentive between the Platform (SRE) team and product development to find the right balance between innovation and reliability.
As you can see from the screenshots, we also started calculating by our new metric impact of each incident.
As the eventual goal of 99.99% was still far away, we started setting up smaller goals and defined red / yellow / green thresholds to illustrate our track on those iterative goals. Once we reached green, we defined the next goal.
Climax
“The climax is the turning point, which changes the protagonist’s fate. If things were going well for the protagonist, the plot will turn against them, often revealing the protagonist’s hidden weaknesses.”
FROM FREYTAG’S PYRAMID OF DRAMA
At the end of July, we reached a tipping point. No matter how hard we tried, the numbers did not improve:
Subjectively, we felt we are doing better, but the numbers told otherwise. There had been two primary reasons for it:
- We continued to fine tune our metrics and continued to add non-standard services that were more difficult to represent or get proper data (like UDP-based Radius), but the new endpoints and services were doing worse, so by including them we were making the number worse
- We had pretty bad reliability incidents in May and July
The mood was not good and the goal seemed impossible. In addition the COVID-19 situation did not make things better.
Project Red-Zone
As a response to the unsatisfactory result, at the beginning of August, we launched project Red-Zone: an engineering-wide focused effort on all the necessary reliability works. We postponed (or deprioritized) all other activities except security bugs and improvements and fully focused on reliability.
Everyone was called to action, we defined the scope of the reliability works that needed to be done before we could get back to normal operations. The Platform/SRE team took temporary lead on prioritizations of all the details and any engineering tasks that were coming in.
Our areas of focus were mostly:
- Login Clusters – finish major redesign of our Hydra architecture (more on Login Clusters in one of the next blogs)
- End-User Success – concerted effort to address the failure response codes – we setup targets to reduce failures by 30% each month
- CAPA – we have “Corrective and Preventive Actions” process that tracks tickets coming from analysis and postmortem after each reliability incident, there was backlog of items to be fixed
- On-Call support – all service owner teams would establish on-call support for their services
The project was a major success.
Not only did we put our original goal back on track and achieve the goal at the end of the year, but we were able to finish some planned major redesigns of our platform that was a cornerstone for future innovations!
This was our track result after the Red-Zone project was started:
New Goal
Our CEO says “the reward for good work is more hard work to do,” so at the end of 2020, we came up with a new goal for 2021: reach four and half nines (99.995%) by end of Q1 and five nines (99.999%) by end of 2021.
Adding “just one nine” seems not a big deal, but it is easy to not realize that this means being 10x better. An order of magnitude improvement often means that what worked before will no longer work and needs a complete rethink with changes in many areas.
So far, we hit our Q1 goal, and I feel optimistic for the end goal. There are a lot of challenges, but hell I like to solve those interesting and non-trivial problems!
Here’s a recent evolution of our reliability dashboard, so stay tuned for future updates!
Note: Reliability vs Availability
Our way of measuring Reliability is much more objective than traditional Availability, which just calculates time when the system was available (= not completely down) divided by total time. This does not count with high traffic hours (both Monday morning and Saturday night are counted equal in traditional Availability) as well as random failing requests during normal operations. And when one goes to high reliability numbers of four, five, six nines, those details matter tremendously.
Based on our experience and applying both calculations on past incidents, our Reliability SLA is “more strict” than a “good” Availability SLA by one, to one and half nines (so four nines of our Reliability SLA is practically equivalent to a four-and-a-half or five nines traditional Availability SLA).
Summary
So, what did we learn?
Defining the metric was a turning point. It did not matter how good or bad it was, but it was a start, and it suddenly and naturally changed the discussions and focus on the single number and to make that number better.
Once the metric number and tracking materialized, it magically aligned the whole organization around the goal. Each small win was visible and celebrated. Everyone understands sales revenue as it is measured in dollars. This is a way to translate the same powerful motivation we know from sales to reliability engineering (or any other area).
Having the initial “dumb” metric in place fostered many, earlier unthought uses and improvements that resulted in key tools and product innovations – like objectively measuring incidents impact, having realtime end-user impact dashboards, future per-customer health dashboards, etc.
Do not let yourself paralyzed by complexity or too high a goal. Start trivial, stupid, dumb, small and iterate, but START. This is a well-known truth, but in my opinion still widely underestimated.
See how we improved our reliability in 2021!