An Approach to Alerting and Monitoring

This is a lesson from my industrial software days. Bear in mind that my first-hand experience here is 20 years old, but I think the concepts still apply. Specifically, in a Service-Oriented Architecture (SOA), these concepts can help determine whether or not a service is operating at an acceptable level. In other words, they help decisions become data-driven decisions.

In manufacturing, there are many ways to measure operational success. One in particular is Overall Equipment Effectiveness (OEE). OEE is a simple percentage of how well the system is working versus how well it is expected to work. That “versus” is important, as I’ve previously mentioned. One common definition of OEE is:

Equipment Availability * Performance Efficiency * Quality Rate

So let’s look at those three individually (and how to apply them to software).

Equipment Availability (EA)

This is how often the equipment (or service) is available versus how often it is expected to be available. When people talk about “five nines” uptime, this is EA. If a service is unavailable for five minutes every day, then its EA is 99.653% (1435 minutes / 1440 minutes). In the context of SOA, this metric is easy to identify, but can sometimes be difficult to measure. For something like a REST API, having a simple heartbeat route can facilitate this.

Performance Efficiency (PE)

This is how quickly the service completes transactions versus how quickly it is expected to complete them. In the context of SOA, this metric is often easy to measure, but it is typically poorly represented (because there’s often no “versus”). Since we already know it should be a percentage (versus!), let’s assume that 100% represents an instantaneous transaction, and then identify a timeout for the transaction (e.g. 30 seconds). Once a transaction hits the timeout (at 0% PE), then it is no longer a performance issue but a quality issue (see below). For example, if a call has a 30s timeout, and it took 5 seconds, then its PE is 83.333% (25 seconds / 30 seconds).

Quality Rate (QR)

This is how often the service successfully completes transactions versus the sum of both successful and failed transactions. In the context of SOA, this metric is often the easiest to measure (and is therefore a good place to start). Just count your errors (because you’re probably already logging them) and count either your successful attempts or all of your attempts (the math works for any two of those three). For example, if 99 of 100 transactions complete successfully and 1 fails, then the QR is 99.000% (99 / 100). And note that a transaction that exceeded its timeout falls into this category, not into PE (see above).

Overall Equipment Effectiveness (OEE)

Given the three examples above, the OEE in this case is 82.213%:

99.653% EA * 83.333% PE * 99.000% QR = 82.213% OEE

This boils a service down to one single metric, which can go on a dashboard and present a red/green kind of feedback. It’s not perfect, but what single metric would be? But this metric (as well as its three constituents) is easy to comprehend quickly because, in part, it is always a “higher is better” percentage. No matter how you slice it, 0 is bad, 100 is good, and the higher you are, the better you’re doing.

The only caveat is that PE tends to track lower than the others because of how it is measured (and that can skew the OEE number). But even for PE, all of those characteristics are true. I’ll also add that some may take exception to the fact that it’s virtually impossible to get PE to 100% (while, over a short enough period, both EA and QR can reach 100%). But I share the opinion of my high school physics teacher: “If 100% is attainable, then the test is too easy.

In the end, though, that one magic number is of only so much value. Therefore…

Focus on EA, PE, and QR

Each of these three Service Level Indicators (or SLIs) are of great usefulness on their own. Much more so than the single OEE number, these are the true value of OEE. In the context of SOA, what an organization can do is establish the baseline EA, PE, and QR for each service, then establish the targets for those (Service Level Objectives or SLOs). And now the owners of those services have specific targets for their prioritization of work. If, as a service owner, you’re “in the green” on both EA and PE, but not hitting your QR number, then you know where you need to focus your efforts.

Now imagine being a service owner whose service has a QR target that is higher than the QR of a service on which you depend. This leads into an important aspect of OEE; that it is meant to be a metric that works across a hierarchy of components. In this case, you have a very real and direct metric that serves as the focal point for a necessary conversation: You’re being given an SLO that you can not reliably meet, due to the dependency.

Where Do We Start?

Your public-facing interfaces are your starting point. Back to the factory analogy, these are what leave the factory and go to your customers. If a factory has five manufacturing lines, four of them are up, and the fifth one is down, then the factory’s EA is 80% (four at 100% and one at 0%). That’s an oversimplification, but the principles still apply. If you have five public-facing services, four are up, and one is down, your EA is also 80%. So what are your public-facing services? Those are the components of your organization’s OEE. And then you work backwards from there, examining internal services that feed those, each with their own OEE actuals and targets.

How it bubbles up is entirely dependent on how your organization is structured. But the targets are yours to set. If you have two public-facing services, and one of them is 80% of your business, then calculate your overall EA, PE, and QR targets based on that ratio. For that factory with one line down, what if that one line is responsible for half the factory’s output? Then their EA is at 50%, not 80%. In the end, every metric is a percentage of something. How you arrive at the targets for those is simply an extension of the priorities you place on them.

We typically track many metrics in software, often too many. We’re often flooded with data, and have difficulty prioritizing alerts and our responses to them. I think this approach helps combat that problem, because it is simple, covers a lot of ground, and is easy to scale, easy to understand, and easy to communicate.

Daily Stand-Ups (aka Daily Scrum)

I’m passionate about understanding what makes software engineering teams great, and applying and sharing that understanding however I can. And of the many different ways I’ve seen software built over the years, I’m convinced that Agile and Scrum – if they are applied well – combine to form a highly successful approach to building software. To that end, I have several posts about different aspects of Agile and Scrum, based on my own experience over the years. Here’s one of them.

Scrum.org says, “The purpose of the Daily Scrum is to inspect progress toward the Sprint Goal and adapt the Sprint Backlog as necessary, adjusting the upcoming planned work.” In my experience, this is the method that has worked most effectively:

  1. Do NOT simply go around the room and give status updates. This meeting has a purpose, and that purpose is not served by “What did you do yesterday?” and “What are you doing today?” To rephrase scrum.org, the purpose is answering the question, “Do we need to change our sprint?” That is the question that matters.

All effective meetings are of four types: Learn, Decide, Do, and Bond. The Daily Stand-Up, as with most Sprint meetings, is a Decide meeting. Its purpose is to make a decision, and its content should contribute directly to that decision. The single best change to your Daily Stand-Up that you can make is to get away from this round-robin format.

We’ve all seen this format. It’s pervasive to the point that it doesn’t even hit our radar anymore. We simply take it for granted that this is how daily stand-ups are supposed to operate. But many of us recognize, instinctively, that there’s something wrong with it. When your instinct is telling you there’s something not quite right, listen to it. Figure out what it’s trying to tell you and make a change.

  1. Limit the meeting (ideally to 15 minutes). This is a daily meeting of the entire scrum team, and has the potential to consume a considerable amount of time over the course of a sprint. This limit can seem artificial if the round-robin anti-pattern is taking place, as time is often taken by those at the head of the line at the expense of those at the tail of it. But once the format is fixed, the time limit is no longer a problem.

Let’s assume a scrum team of six people is meeting daily for 15 minutes for a two-week sprint cycle. Let’s also assume that, for whatever reason, the stand-up is only occurring for eight of the ten days of the sprint, and that each participant requires zero time for context-switching before and after the stand-up (an admittedly false assumption). That is still 12 work hours spent by the team, cumulatively, on the daily stand-up. Twelve hours every sprint. And that’s the ideal. Reality is often far, far more than that.

In one situation, my team was meeting for 45 minutes to an hour every day. It was a drain in many ways, and I attempted at one point to fix it. What I learned the hard way (after half the team reacted negatively) was that the daily stand-up for my geographically diverse team actually served two purposes – Decide and Bond. This was the team’s opportunity to socialize, and I took that away. A better move would have been to separate the two, keeping the daily stand-up efficient while also providing the team with other ways to bond, but external factors meant that did not come to pass before I had moved on.

  1. Address the sprint goals in priority order. For each, ask this question: “Are we still on track to accomplish this goal by the end of the sprint?” If yes, then move on to the next goal (or user story). If no, then decide, then and there, how you will adjust. Do you move it out of the sprint? Do you move a lower priority item out of the sprint? Do you reduce the scope of this item? There are many actions that can be taken, and the only wrong one is to do nothing.

This is what makes the 15-minute time cap work. If you’re starting with the highest priority issue, then your 15 minutes is well-spent. If you get through all of the items, then that’s great! If not, then you still ensure that you’re covering the most important goals while keeping to the schedule.

This depends, of course, on having good prioritization. Many teams fail that step, and then – because they don’t have their work prioritized – they shrug off conducting their daily stand-ups this way. One typical excuse is that all of the individual contributors are “doing their own thing” anyway. If that’s the case, then re-evaluate how you’ve formed your scrum team. If you’re not working together toward common team goals, then you’re not really a team. And as part of that evaluation, I would even consider the possibility that perhaps the scrum format isn’t right for you.

And that’s what it takes for effective Daily Stand-Ups. Remember, above all, that it’s a Decide meeting. Make the appropriate decisions and then get on with your day.