An Approach to Alerting and Monitoring

This is a lesson from my industrial software days. Bear in mind that my first-hand experience here is 20 years old, but I think the concepts still apply. Specifically, in a Service-Oriented Architecture (SOA), these concepts can help determine whether or not a service is operating at an acceptable level. In other words, they help decisions become data-driven decisions.

In manufacturing, there are many ways to measure operational success. One in particular is Overall Equipment Effectiveness (OEE). OEE is a simple percentage of how well the system is working versus how well it is expected to work. That “versus” is important, as I’ve previously mentioned. One common definition of OEE is:

Equipment Availability * Performance Efficiency * Quality Rate

So let’s look at those three individually (and how to apply them to software).

Equipment Availability (EA)

This is how often the equipment (or service) is available versus how often it is expected to be available. When people talk about “five nines” uptime, this is EA. If a service is unavailable for five minutes every day, then its EA is 99.653% (1435 minutes / 1440 minutes). In the context of SOA, this metric is easy to identify, but can sometimes be difficult to measure. For something like a REST API, having a simple heartbeat route can facilitate this.

Performance Efficiency (PE)

This is how quickly the service completes transactions versus how quickly it is expected to complete them. In the context of SOA, this metric is often easy to measure, but it is typically poorly represented (because there’s often no “versus”). Since we already know it should be a percentage (versus!), let’s assume that 100% represents an instantaneous transaction, and then identify a timeout for the transaction (e.g. 30 seconds). Once a transaction hits the timeout (at 0% PE), then it is no longer a performance issue but a quality issue (see below). For example, if a call has a 30s timeout, and it took 5 seconds, then its PE is 83.333% (25 seconds / 30 seconds).

Quality Rate (QR)

This is how often the service successfully completes transactions versus the sum of both successful and failed transactions. In the context of SOA, this metric is often the easiest to measure (and is therefore a good place to start). Just count your errors (because you’re probably already logging them) and count either your successful attempts or all of your attempts (the math works for any two of those three). For example, if 99 of 100 transactions complete successfully and 1 fails, then the QR is 99.000% (99 / 100). And note that a transaction that exceeded its timeout falls into this category, not into PE (see above).

Overall Equipment Effectiveness (OEE)

Given the three examples above, the OEE in this case is 82.213%:

99.653% EA * 83.333% PE * 99.000% QR = 82.213% OEE

This boils a service down to one single metric, which can go on a dashboard and present a red/green kind of feedback. It’s not perfect, but what single metric would be? But this metric (as well as its three constituents) is easy to comprehend quickly because, in part, it is always a “higher is better” percentage. No matter how you slice it, 0 is bad, 100 is good, and the higher you are, the better you’re doing.

The only caveat is that PE tends to track lower than the others because of how it is measured (and that can skew the OEE number). But even for PE, all of those characteristics are true. I’ll also add that some may take exception to the fact that it’s virtually impossible to get PE to 100% (while, over a short enough period, both EA and QR can reach 100%). But I share the opinion of my high school physics teacher: “If 100% is attainable, then the test is too easy.“

In the end, though, that one magic number is of only so much value. Therefore…

Focus on EA, PE, and QR

Each of these three Service Level Indicators (or SLIs) are of great usefulness on their own. Much more so than the single OEE number, these are the true value of OEE. In the context of SOA, what an organization can do is establish the baseline EA, PE, and QR for each service, then establish the targets for those (Service Level Objectives or SLOs). And now the owners of those services have specific targets for their prioritization of work. If, as a service owner, you’re “in the green” on both EA and PE, but not hitting your QR number, then you know where you need to focus your efforts.

Now imagine being a service owner whose service has a QR target that is higher than the QR of a service on which you depend. This leads into an important aspect of OEE; that it is meant to be a metric that works across a hierarchy of components. In this case, you have a very real and direct metric that serves as the focal point for a necessary conversation: You’re being given an SLO that you can not reliably meet, due to the dependency.

Where Do We Start?

Your public-facing interfaces are your starting point. Back to the factory analogy, these are what leave the factory and go to your customers. If a factory has five manufacturing lines, four of them are up, and the fifth one is down, then the factory’s EA is 80% (four at 100% and one at 0%). That’s an oversimplification, but the principles still apply. If you have five public-facing services, four are up, and one is down, your EA is also 80%. So what are your public-facing services? Those are the components of your organization’s OEE. And then you work backwards from there, examining internal services that feed those, each with their own OEE actuals and targets.

How it bubbles up is entirely dependent on how your organization is structured. But the targets are yours to set. If you have two public-facing services, and one of them is 80% of your business, then calculate your overall EA, PE, and QR targets based on that ratio. For that factory with one line down, what if that one line is responsible for half the factory’s output? Then their EA is at 50%, not 80%. In the end, every metric is a percentage of something. How you arrive at the targets for those is simply an extension of the priorities you place on them.

We typically track many metrics in software, often too many. We’re often flooded with data, and have difficulty prioritizing alerts and our responses to them. I think this approach helps combat that problem, because it is simple, covers a lot of ground, and is easy to scale, easy to understand, and easy to communicate.