An Approach to Alerting and Monitoring

This is a lesson from my industrial software days. Bear in mind that my first-hand experience here is 20 years old, but I think the concepts still apply. Specifically, in a Service-Oriented Architecture (SOA), these concepts can help determine whether or not a service is operating at an acceptable level. In other words, they help decisions become data-driven decisions.

In manufacturing, there are many ways to measure operational success. One in particular is Overall Equipment Effectiveness (OEE). OEE is a simple percentage of how well the system is working versus how well it is expected to work. That “versus” is important, as I’ve previously mentioned. One common definition of OEE is:

Equipment Availability * Performance Efficiency * Quality Rate

So let’s look at those three individually (and how to apply them to software).

Equipment Availability (EA)

This is how often the equipment (or service) is available versus how often it is expected to be available. When people talk about “five nines” uptime, this is EA. If a service is unavailable for five minutes every day, then its EA is 99.653% (1435 minutes / 1440 minutes). In the context of SOA, this metric is easy to identify, but can sometimes be difficult to measure. For something like a REST API, having a simple heartbeat route can facilitate this.

Performance Efficiency (PE)

This is how quickly the service completes transactions versus how quickly it is expected to complete them. In the context of SOA, this metric is often easy to measure, but it is typically poorly represented (because there’s often no “versus”). Since we already know it should be a percentage (versus!), let’s assume that 100% represents an instantaneous transaction, and then identify a timeout for the transaction (e.g. 30 seconds). Once a transaction hits the timeout (at 0% PE), then it is no longer a performance issue but a quality issue (see below). For example, if a call has a 30s timeout, and it took 5 seconds, then its PE is 83.333% (25 seconds / 30 seconds).

Quality Rate (QR)

This is how often the service successfully completes transactions versus the sum of both successful and failed transactions. In the context of SOA, this metric is often the easiest to measure (and is therefore a good place to start). Just count your errors (because you’re probably already logging them) and count either your successful attempts or all of your attempts (the math works for any two of those three). For example, if 99 of 100 transactions complete successfully and 1 fails, then the QR is 99.000% (99 / 100). And note that a transaction that exceeded its timeout falls into this category, not into PE (see above).

Overall Equipment Effectiveness (OEE)

Given the three examples above, the OEE in this case is 82.213%:

99.653% EA * 83.333% PE * 99.000% QR = 82.213% OEE

This boils a service down to one single metric, which can go on a dashboard and present a red/green kind of feedback. It’s not perfect, but what single metric would be? But this metric (as well as its three constituents) is easy to comprehend quickly because, in part, it is always a “higher is better” percentage. No matter how you slice it, 0 is bad, 100 is good, and the higher you are, the better you’re doing.

The only caveat is that PE tends to track lower than the others because of how it is measured (and that can skew the OEE number). But even for PE, all of those characteristics are true. I’ll also add that some may take exception to the fact that it’s virtually impossible to get PE to 100% (while, over a short enough period, both EA and QR can reach 100%). But I share the opinion of my high school physics teacher: “If 100% is attainable, then the test is too easy.

In the end, though, that one magic number is of only so much value. Therefore…

Focus on EA, PE, and QR

Each of these three Service Level Indicators (or SLIs) are of great usefulness on their own. Much more so than the single OEE number, these are the true value of OEE. In the context of SOA, what an organization can do is establish the baseline EA, PE, and QR for each service, then establish the targets for those (Service Level Objectives or SLOs). And now the owners of those services have specific targets for their prioritization of work. If, as a service owner, you’re “in the green” on both EA and PE, but not hitting your QR number, then you know where you need to focus your efforts.

Now imagine being a service owner whose service has a QR target that is higher than the QR of a service on which you depend. This leads into an important aspect of OEE; that it is meant to be a metric that works across a hierarchy of components. In this case, you have a very real and direct metric that serves as the focal point for a necessary conversation: You’re being given an SLO that you can not reliably meet, due to the dependency.

Where Do We Start?

Your public-facing interfaces are your starting point. Back to the factory analogy, these are what leave the factory and go to your customers. If a factory has five manufacturing lines, four of them are up, and the fifth one is down, then the factory’s EA is 80% (four at 100% and one at 0%). That’s an oversimplification, but the principles still apply. If you have five public-facing services, four are up, and one is down, your EA is also 80%. So what are your public-facing services? Those are the components of your organization’s OEE. And then you work backwards from there, examining internal services that feed those, each with their own OEE actuals and targets.

How it bubbles up is entirely dependent on how your organization is structured. But the targets are yours to set. If you have two public-facing services, and one of them is 80% of your business, then calculate your overall EA, PE, and QR targets based on that ratio. For that factory with one line down, what if that one line is responsible for half the factory’s output? Then their EA is at 50%, not 80%. In the end, every metric is a percentage of something. How you arrive at the targets for those is simply an extension of the priorities you place on them.

We typically track many metrics in software, often too many. We’re often flooded with data, and have difficulty prioritizing alerts and our responses to them. I think this approach helps combat that problem, because it is simple, covers a lot of ground, and is easy to scale, easy to understand, and easy to communicate.

Daily Stand-Ups (aka Daily Scrum)

I’m passionate about understanding what makes software engineering teams great, and applying and sharing that understanding however I can. And of the many different ways I’ve seen software built over the years, I’m convinced that Agile and Scrum – if they are applied well – combine to form a highly successful approach to building software. To that end, I have several posts about different aspects of Agile and Scrum, based on my own experience over the years. Here’s one of them.

Scrum.org says, “The purpose of the Daily Scrum is to inspect progress toward the Sprint Goal and adapt the Sprint Backlog as necessary, adjusting the upcoming planned work.” In my experience, this is the method that has worked most effectively:

  1. Do NOT simply go around the room and give status updates. This meeting has a purpose, and that purpose is not served by “What did you do yesterday?” and “What are you doing today?” To rephrase scrum.org, the purpose is answering the question, “Do we need to change our sprint?” That is the question that matters.

All effective meetings are of four types: Learn, Decide, Do, and Bond. The Daily Stand-Up, as with most Sprint meetings, is a Decide meeting. Its purpose is to make a decision, and its content should contribute directly to that decision. The single best change to your Daily Stand-Up that you can make is to get away from this round-robin format.

We’ve all seen this format. It’s pervasive to the point that it doesn’t even hit our radar anymore. We simply take it for granted that this is how daily stand-ups are supposed to operate. But many of us recognize, instinctively, that there’s something wrong with it. When your instinct is telling you there’s something not quite right, listen to it. Figure out what it’s trying to tell you and make a change.

  1. Limit the meeting (ideally to 15 minutes). This is a daily meeting of the entire scrum team, and has the potential to consume a considerable amount of time over the course of a sprint. This limit can seem artificial if the round-robin anti-pattern is taking place, as time is often taken by those at the head of the line at the expense of those at the tail of it. But once the format is fixed, the time limit is no longer a problem.

Let’s assume a scrum team of six people is meeting daily for 15 minutes for a two-week sprint cycle. Let’s also assume that, for whatever reason, the stand-up is only occurring for eight of the ten days of the sprint, and that each participant requires zero time for context-switching before and after the stand-up (an admittedly false assumption). That is still 12 work hours spent by the team, cumulatively, on the daily stand-up. Twelve hours every sprint. And that’s the ideal. Reality is often far, far more than that.

In one situation, my team was meeting for 45 minutes to an hour every day. It was a drain in many ways, and I attempted at one point to fix it. What I learned the hard way (after half the team reacted negatively) was that the daily stand-up for my geographically diverse team actually served two purposes – Decide and Bond. This was the team’s opportunity to socialize, and I took that away. A better move would have been to separate the two, keeping the daily stand-up efficient while also providing the team with other ways to bond, but external factors meant that did not come to pass before I had moved on.

  1. Address the sprint goals in priority order. For each, ask this question: “Are we still on track to accomplish this goal by the end of the sprint?” If yes, then move on to the next goal (or user story). If no, then decide, then and there, how you will adjust. Do you move it out of the sprint? Do you move a lower priority item out of the sprint? Do you reduce the scope of this item? There are many actions that can be taken, and the only wrong one is to do nothing.

This is what makes the 15-minute time cap work. If you’re starting with the highest priority issue, then your 15 minutes is well-spent. If you get through all of the items, then that’s great! If not, then you still ensure that you’re covering the most important goals while keeping to the schedule.

This depends, of course, on having good prioritization. Many teams fail that step, and then – because they don’t have their work prioritized – they shrug off conducting their daily stand-ups this way. One typical excuse is that all of the individual contributors are “doing their own thing” anyway. If that’s the case, then re-evaluate how you’ve formed your scrum team. If you’re not working together toward common team goals, then you’re not really a team. And as part of that evaluation, I would even consider the possibility that perhaps the scrum format isn’t right for you.

And that’s what it takes for effective Daily Stand-Ups. Remember, above all, that it’s a Decide meeting. Make the appropriate decisions and then get on with your day.

Sprint Reviews

I’m passionate about understanding what makes software engineering teams great, and applying and sharing that understanding however I can. And of the many different ways I’ve seen software built over the years, I’m convinced that Agile and Scrum – if they are applied well – combine to form a highly successful approach to building software. To that end, I have several posts about different aspects of Agile and Scrum, based on my own experience over the years. Here’s one of them.

Scrum.org says, “The purpose of the Sprint Review is to inspect the outcome of the Sprint and determine future adaptations. The Scrum Team presents the results of their work to key stakeholders and progress toward the Product Goal is discussed.” In my experience, this is the method that has worked most effectively:

  1. Review what was accomplished during the Sprint in priority order, for two reasons. First, good meeting discipline is essential to healthy teams. So get to the important stuff first. If you end up spending all of the allotted time on the first item, then so be it. And second, this reinforces the team’s understanding of the priorities. If the team demonstrates Item A, but the stakeholders are more interested in Item B, then that’s a discussion to be had. And if everyone understands that the Sprint Review is in priority order, then the moment the team brings up Item A is when the stakeholders should recognize the problem.

Shortly after the Sprint Review, the team should be entering into Sprint Planning for the next Sprint. And what happens in the former informs what happens in the latter. If you don’t get to the important items in time, then you’re flying blind in Sprint Planning. And if you’re running over in Sprint Review, then you’re likely impacting the schedule for Sprint Planning. In both cases, you’re messing up the next Sprint before it can even get started.

I’ve been a part of plenty of teams for which “everything’s a priority.” And it never works. Being able to properly stack-rank a team’s work is essential. And the Sprint Review is the best place to fix this. You have all the necessary people in the same room together. If you have a prioritization issue, then take this time and fix it. You can ask nicely: “What would you like to see first?” is both subtle and effective. But hold the stakeholders hostage in the meeting if you must.

  1. Have the tester/reviewer for each work item present it. Maybe you have dedicated QA and testing staff, or rely on peer reviews and oversight by other developers, or some a mix of the two, or maybe some other solution altogether. Whatever the make-up of your quality control, have the people who fill this role be the ones who present it.

I’ve found this advice to be quite controversial. But I have also found that it raises the team’s effectiveness considerably. Team members learn more from each other. Code quality goes up. The bus factor goes up. Even the simple fact that it calls out that every work item should have someone in that role is a benefit. It’s too easy for teams to slip into bad habits like rubber-stamp PR reviews, and this attacks those bad habits directly.

As the developer in this scenario, I’ve found that one way this practice improves the quality of my work comes from my desire to not embarrass the presenter. It’s one thing for me to shrug off an issue during a demo. It’s another matter entirely to watch someone else look bad in front of everyone due to my mistake. And it also raises my game as the reviewer. If I have to present someone else’s work to stakeholders, I’m going to make sure I thoroughly understand it and have thoroughly vetted it.

  1. Accept feedback but do not make decisions yet. The only decisions that should come out of the Sprint Review are changes in prioritization. The rest is information-gathering for topics to be discussed later (probably in Backlog Refinement meetings).

The stakeholders in a Sprint Review frequently outrank the presenters. In those situations, it’s too easy to “give in” to whatever demands they might make. But such demands should be given careful consideration, and the Sprint Review is not the time and place for that.

I’ve been in situations in which a Sprint Review was a massive eye-opener. What the team presented was not what stakeholders were expecting. It is very easy to get off the rails in these meetings, especially when those same stakeholders begin to panic over direction, resources, and timelines. As with so many issues, the key is prioritization. What needs to happen next? How that happens is a discussion to be had later. It’s very easy in these meetings to focus on how, and not what or why. Be clear about the what and why, and get the question of how off the table.

And that’s all there is to a good Sprint Review. Remember, with a large and diverse group of participants, it’s important to stay focused on just the essential elements – show the stakeholders what has been done, in priority order, solicit feedback, and then settle on any changes to those priorities.

Sprint Retros

I’m passionate about understanding what makes software engineering teams great, and applying and sharing that understanding however I can. And of the many different ways I’ve seen software built over the years, I’m convinced that Agile and Scrum – if they are applied well – combine to form a highly successful approach to building software. To that end, I have several posts about different aspects of Agile and Scrum, based on my own experience over the years. Here’s one of them.

Scrum.org says, “the purpose of the Sprint Retrospective is to plan ways to increase quality and effectiveness.” In my experience, this is the method that has worked most effectively:

  1. Review the outcomes of the previous Retro. Were all follow-up tasks completed? If not, why not? If so, did they have the desired effect? Why or why not? Is there something that should be done differently or in addition to the actions already taken? Is this course of action worth pursuing (at the expense of one or the other of the new work items that could come out of this Retro)? If so, then this becomes the scope of this Retro (and not what is presented below). If not, then document as necessary and move on.

For Retros to be effective, there must be follow-up to ensure they are actually having an effect. If a Retro is nothing more than “shouting into the wind,” then steps need to be taken to address this. This breakdown in the core purpose of the Retro (for that matter, any Retro-related breakdown) must be addressed first, because an effective Retro is the mechanism by which other things change. If the mechanism itself is broken, then it is unreasonable to expect change to occur effectively.

I’ve been a part of a lot of Retros that did not do this effectively. They didn’t revisit what was brought up before. This, more than any other mistake, is what feeds the “nothing ever changes here” mentality. If you don’t accomplish anything, then a Retro is nothing more than a gripe session. And if you do accomplish what you set out to change – when you begin your Retro by checking something off the list – then you’re kicking off a crucial meeting with a collective dopamine hit. What could be better than that?

  1. Through whatever means by which the team has agreed upon, you propose, discuss, debate, and vote upon candidate topics. There are a ton of apps out there for doing this. I prefer not to use the “What Went Well? / What Didn’t Go Well? / What Can Be Improved?” style (see this example). I would rather push the “What Went Well?” part of the discussion to a parking lot. It’s technically unnecessary, and having it after the meeting is officially over makes it more voluntary and less forced. And “What Didn’t Go Well? / What Can Be Improved?” are two sides of the same coin. But it’s a popular approach, and I’m not opposed to doing it that way. The important aspect for me – however you do it – is to ultimately arrive at a consensus on two items:
  • An action that the team can take to improve its situation.
    • An issue that cannot be addressed by the team itself, but should be brought to the attention of management.

Like all things in Agile and Scrum, prioritization is key. With effective prioritization, the team can make the trade-offs and other decisions necessary to accomplish its goals. So by prioritizing the one thing the team can change and the one thing it needs from management, the team can make the most effective use of this Retro. All lower priorities can wait until the next Retro. As long as the pattern is followed, all reasonably important items will be addressed in due time.

In my experience, this is greatly helped by every member of the team thinking about this ahead of time (and perhaps even logging their thoughts in some kind of shared document ahead of time as well). Sometimes, the roots of one or both items are obvious even before the meeting begins. But even then, identifying the problem is only part of it. You also need to figure out, as a team, what you’re going to do about it. If everyone puts some thought into this ahead of time, then you’re coming into the meeting with options to discuss and debate.

  1. Set the expectation for the one action to take. Ideally, it becomes a work item to be added to the next Sprint. As with all work items, there should be a clear indication of when the item is done. If its scope appears to extend beyond a Sprint, then identify what smaller portion of it can be done within the Sprint, and ensure that an appropriate follow-up for the remainder of the work is included in the Acceptance Criteria.

This one item does not need to be tackled by the whole team. It is often a work item assigned to just one person. There’s often one or two team members who are especially passionate about whatever the issue might be. Let them run with it. The important part is that it has been decided by the team as a whole that this is the highest priority action that can be taken by the team. It is not decided by one or two individuals, and definitely not by management.

I’ve often had these work items start out as great big plans. For example, one was to rebuild a shared development environment because the current one had deteriorated to a critical state. The scope that was eventually decided upon was to simply identify the steps necessary to make that happen. The Acceptance Criteria of that work item was something along the lines of, “There is a plan.” So by the next Retro, we had an idea of what it would actually take to rebuild the environment, and we could prioritize those steps against the rest of our work and predict when we might have that environment available.

  1. Ensure that the team’s representative to management has a clear understanding of the message to take to them. Have the representative state it to the team in their own words. Ideally, the team should be able to raise sensitive concerns this way, so it is especially important that they are communicated clearly. Bear in mind that even the choice of what to communicate is itself a form of communication. It says, “We care about this issue more than all others right now.”

The team’s expectation should be only that its concern is heard, not that a particular action is taken. It is up to management to decide how to respond, perhaps to take immediate action, provide a more detailed explanation of future plans, or even simply to acknowledge that the concern has been heard without any more context.

When I was that representative, there was a time or two in which our VP of Engineering heard our concern, acknowledged its validity, and chose not to take action. He had his reasons, and he did share some of his reasoning at times. But because his position made him privy to information that could not or should not be shared with us, we had to trust his decisions. But trust isn’t a feature to be bolted on at some arbitrary point. It’s an essential aspect to a team’s success, and he’d earned that trust over time (just like we had earned his). So he could say “No” and that was that.

  1. Once these objectives are accomplished, then the Retro is complete. This is regardless of the amount of time taken to complete them. While other Sprint ceremonies should be more strictly time-boxed, the role of the Retro in the overall health of the team and its productivity demands that it be given whatever time is needed.

Good meeting discipline is as important as it is rare. Start meetings on time. Have an agenda. Post that agenda ahead of time. Close the meeting once the objective is accomplished. Do not go over the limit. The Sprint Retro is the “exception that makes the rule.” As I said earlier, if this mechanism is broken, then change doesn’t happen. So, within reason, if the Retro goes over, let it go over.

I was on a team that scheduled 30-minute Retros. We were going through a rough patch, and we went over that limit once or twice. Very quickly, we adjusted the schedule to make them 60-minute meetings so that we would not go over. And then, over time, as we got things under control, we lowered that back down to 30 minutes. I’m happy to say that we even got to the point at which there were a handful of times that we met and decided that there was nothing to improve. We were happy with how things were going. We still found things to complain about, of course, but we decided as a team that none of them were worth tweaking the process, and that it was best to leave it alone.

I miss that team.

Technical Interviews

Because it’s fresh in my mind, I’ll leave this here for candidates looking for software engineering roles. I typically interview for back-end C#/.Net/T-SQL developer roles, so what I have to say here is very much targeted to that. Anyway, in an interview, I generally ask two technical questions:

  1. What are the three methods for adding services in Dependency Injection?
  2. What’s the difference between a clustered and a non-clustered index?

The three methods are AddScoped(), AddSingleton(), and AddTransient(). They govern the scope/lifetime of resources created through DI. If you are capable of answering that, then it can easily lead to a discussion about their differences, when to use each, more general DI topics, etc.

The difference between the indexes is that the clustered index is the table (and so there can be only one). It’s how data in the table is organized in storage and in memory. Non-clustered indexes are additional indexes that are stored separately, but have in their leaf nodes the keys of the clustered index. And while there is a limit, the number of them for any given table is more of a design/performance choice. Here, the discussion can head toward B-tree logic, branch versus leaf nodes, key look-up performance, scans versus seeks, etc.

If you’re billed as a senior API developer on the Microsoft stack, then I expect you to know the former. If a senior database developer, then I expect you to know the latter. In my experience, very few candidates know both (and surprisingly few candidates know either). But what really irks me is when someone tries to bluff their way through one of these.

What I would much rather see is someone turn the question around. “Well, I know indexes obviously apply to tables and their use affects performance, but I don’t know the difference. Can you explain it to me?” Show me that you understand your limits and are willing to learn. Tell me what you do know, and let’s work from there. Let’s see how well you expand your knowledge on the fly, under a little pressure. Don’t try to BS me.

Whatever you know about these topics, I guarantee I can find some corner where I know more than you. It might take me a minute to get there, and it may be some esoteric quirk that will almost never come up in the real world, but I’ll get there. My goal is not to show I’m smarter than you (and, in fact, I’m genuinely hoping you’re smarter than me). My goal is to find your edge and pick away at it and see how you react.

Example: I had a candidate recently who was primarily a database developer. He had some .Net skills, but that part of his resume was marginal. But the recruiter brought him to me as a potentially very good candidate for the role because of his database experience. He really had that highlighted on his resume. It was clear that he thought of himself as a go-to for database development.

So I asked the index question. And he BS’d his way through an answer. And then didn’t follow up. Didn’t inquire about the accuracy of his answer. Didn’t probe at all. The interview carried on for a while after that, because I want to be sure of my assessment and not just make a snap judgment. But nothing afterward convinced me that this first impression was incorrect, so we were effectively done right there.

Lesson 1: If you think you’re an expert in an area, then you’d better know what you don’t know. If you don’t think the rabbit hole goes any deeper, then you’re not an expert. I’m very much aware of how little I know of SQL Server in general, and this is coming from someone who is regularly called “the database guy”.

Lesson 2: Failure to answer the question is not failure. Failure to follow up is. Show me that you’re willing and able to learn. Work the problem, right there in the interview. In tech, we will be faced with problems and won’t know the answers. We will be expected to figure them out. So show me that you can figure stuff out. This ends up being more true of junior candidates. They’re early enough in their careers that everything is a learning experience. But what I see is senior candidates with too much swagger, thinking they know all they need to know, and then spending the interview trying to convince me of the same. And those are exactly the kinds of candidates I will not hire.

Lesson 3: If you’re interviewing for a back-end C#/.Net/T-SQL role, go figure out the answers to those two questions. Not just what I wrote, but go for a real understanding of them. Get to the point at which you can answer “Why?” questions about them. If you go through enough interviews, I guarantee they will both come up at some point. I know I’m not the only interviewer who asks them. I stole the former from a colleague, and I know of several others who treat the latter as their go-to.

And good luck in your interviews!

Big Files, ETL Woes, and PowerShell

I have a love/hate relationship with PowerShell. It’s role in the grand scheme of my work is an extraordinarily valuable one. But the syntax often drives me nuts. It’s like bash, C#, and old COM-era Visual Basic got smushed together.

But, that rant aside, when I can figure out the syntax, I love the functionality. Here’s one example I’m sharing mostly for when I’ll inevitably need it later myself.

It came from working with enormous text files, and the gotchas that come with them. In this case, I had a 50gb data file that had something wrong with it, buried about 25% in. The file was being processed just fine, until it hit this unexpected data. And because the ETL app was written for performance first, there wasn’t a lot of data validation being done. So it’d just go boom when it hit that batch.

So what was wrong with the file? Well, in order to determine that, I had to see what was in that batch. But you can’t just open a 50gb file in Notepad. Yes, there are other ways to get around this, but here’s the one I chose:

Get-Content {bigFile} | Select-Object -Skip 10000000 -First 20000 | Out-File -FilePath {newFile}

It’s pretty evident what this does. But just to be clear, it skips the first 10 million rows in the file, then writes the next 20,000 rows out to a new file.

Here’s the thing… from my googling, I was halfway expecting this not to work, because it seemed like Get-Content would chew up memory. Was it really unloading rows once it read them? Or was it going to die after eating up all the available memory? As it turns out, it appears to have been doing the former. Slowly, I’ll admit, but performance here was not my biggest concern. I just wanted to get a manageable chunk out of the middle of the file without blowing up memory usage. And that’s what I got.

I was able to track down the errant data – an opening quote with no closing quote – once I had the bad batch isolated.

So, a small win, but a win nonetheless. Which is pretty much on par for my experience with PowerShell. Hope this helps someone else, too.

Unit Testing Dynamic SQL

Dynamic SQL (aka Ad Hoc SQL) is SQL code that is generated at runtime. It’s quite common. Nearly every system I’ve supported in the past 30 years uses it to some degree, some more than others.

It can also be a particularly nasty pain point in a lot of systems. It can be a security vulnerability. It can be difficult to troubleshoot. It can be difficult to document. And it can produce some wickedly bad results.

When I started including T-SQL unit testing into my solutions a few years ago, Dynamic SQL quickly rose to the top of the pile of code to test. A pattern quickly emerged, and has gone largely unchanged since then. Here’s how it works:

  1. First, bear in mind that I am adamant about not allowing application code to generate T-SQL code. Stored procedures provide an immensely useful API into the database, and I will enforce the exclusive use of stored procedures as that interface in every situation I possibly can. So this solution naturally builds on that stipulation.
  2. When generating Dynamic SQL in stored procedures, I will create the procedures in pairs. The first procedure calls the second, which generates the Dynamic SQL. That second procedure then returns that result as a string to the first procedure. The first one then executes the string that it receives.
  3. I write my unit test to execute the second procedure only, and then do a simple string comparison to a control string. If the strings match, then the unit test passes.

And that’s it.

Here’s an example, which is part of a painfully over-complicated demo project I put together for my session at the 2022 Summit:

Execute:

CREATE OR ALTER PROCEDURE [so].[Archive_SelectPagedData]
    @ArchiveId BIGINT,
    @Offset BIGINT,
    @Batch BIGINT
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @stmt NVARCHAR(MAX);

    IF (@ArchiveId IS NULL)
            OR NOT EXISTS (SELECT 1
                    FROM [so].[Archive]
                    WHERE [ArchiveId] = @ArchiveId)
        THROW 60001, 'ArchiveId', 0;

    IF NOT EXISTS (SELECT 1 FROM [so].[Archive])
        THROW 60004, 'Archive', 1;

    EXEC [so].[Archive_SelectPagedData_s]
        @ArchiveId = @ArchiveId,
        @Stmt = @stmt OUT;

    EXEC sp_executesql
        @stmt = @stmt,
        @params = N'@offset BIGINT, @batch BIGINT',
        @offset = @Offset,
        @batch = @Batch;
END;
GO

Generate:

CREATE OR ALTER PROCEDURE [so].[Archive_SelectPagedData_s]
    @ArchiveId BIGINT,
    @Stmt NVARCHAR(MAX) OUT
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE
        @columns NVARCHAR(MAX) = N'',
        @fullName NVARCHAR(256),
        @orderBy NVARCHAR(MAX) = N'',
        @tableType INT;

    SELECT
            @fullName = N'[dump].' + QUOTENAME(a.[ForumName] + N'_' + tt.[TableTypeName]),
            @tableType = a.[TableType]
        FROM [so].[Archive] AS a
            INNER JOIN [so].[TableType] AS tt
                ON a.[ArchiveId] = @ArchiveId
                AND a.[TableType] = tt.[TableTypeId];

    SELECT
            @columns += N'
        ' + QUOTENAME(cd.[ColumnName]) + N',',
            @orderBy += CASE cd.[IsPrimaryKey] WHEN 1 THEN QUOTENAME(cd.[ColumnName]) + N',' ELSE N'' END
        FROM [so].[ColumnDefinition] AS cd
        WHERE cd.[TableType] = @tableType
        ORDER BY cd.[ColumnIndex];
    SELECT @columns = LEFT(@columns, LEN(@columns) - 1)
    SELECT @orderBy = LEFT(@orderBy, LEN(@orderBy) - 1)

    SELECT @Stmt = N'/* so.Archive_SelectPagedData */
SELECT{columns}
    FROM {fullName}
    ORDER BY {orderBy}
        OFFSET @offset ROWS FETCH NEXT @batch ROWS ONLY;';

    SELECT @Stmt = REPLACE(@Stmt, N'{fullName}', @fullname);
    SELECT @Stmt = REPLACE(@Stmt, N'{columns}', @columns);
    SELECT @Stmt = REPLACE(@Stmt, N'{orderBy}', @orderBy);

    RETURN 0;
END;
GO

Test:

/* ARRANGE */
SET NOCOUNT ON;
SET XACT_ABORT ON;
DECLARE
    @compare NVARCHAR(MAX),
    @id BIGINT,
    @stmt NVARCHAR(MAX);
SELECT @compare = N'/* so.Archive_SelectPagedData */
SELECT
        [Id],
        [TagName],
        [Count],
        [ExcerptPostId],
        [WikiPostId]
    FROM [dump].[foo_Tags]
    ORDER BY [Id]
        OFFSET @offset ROWS FETCH NEXT @batch ROWS ONLY;';
BEGIN TRANSACTION;
EXEC [so].[Archive_Upsert]
    @ArchiveId = @id OUT,
    @ForumName = N'foo',
    @TableType = 6,
    @SourcePath = N'bar';

/* ACT */
EXEC [so].[Archive_SelectPagedData_s]
    @ArchiveId = @id,
    @Stmt = @stmt OUT;

/* ASSERT */
IF (@compare != ISNULL(@stmt, N''))
BEGIN
    PRINT ISNULL(@stmt, N'(null)');
    THROW 50002, N'so.Archive_SelectPagedData_s_1', 0;
END;
ROLLBACK TRANSACTION;
GO

The first thing to note about this approach is that the strings must 100% match, right down to the white space. That may seem like an unrealistic obstacle at first. But my experience hasn’t supported that assumption. In fact, chasing down the few cases in which tests are failing due to small things like an extra space can often result in the discovery of larger issues that might have otherwise gone unnoticed.

Enforcing such consistency also aids in managing your execution plan reuse. Side Note: Don’t forget to check your Optimize for Ad Hoc Workloads setting!

This approach also aids documentation and troubleshooting. The unit test requires an exact instance of the string being generated. With unit testing like this, you have examples of what the code should be creating right in front of you!

And, if the existing unit tests aren’t enough to identify a problem, then the same pattern can be used while troubleshooting. Once an issue has been identified this way and rectified, then you’re already most of the way to creating a covering test for the new case!

So there ya go… unit testing for your Dynamic SQL. Enjoy!

EF Core and nvarchar(max)

My opinion was once, “Entity Framework is slow.” As I dive more deeply into the nuances of Entity Framework Core, that opinion is changing, but only slightly. It’s becoming, “Entity Framework is slow out of the box.”

There are ways to make it better. I’ll describe one of those here, but first, let’s establish the environment:

– Entity Framework Core 7.0.5

– The “code-first” approach

– SQL Server database (any in-support version)

So what does EF Core do wrong?

In this situation, when you define a string field of a model, the underlying data type will be nvarchar(max). For those who don’t know, this is a variable-length Unicode string with an effective maximum length of 231-1 characters (2 GB). This is a combination of three choices that EF Core is making:

1. nvarchar(max): The desired encoding (ASCII versus Unicode) is unknown. Therefore, EF Core defaults to the “safer” option, Unicode.

2. nvarchar(max): The preference between a fixed-length string and a variable-length string is unknown. Because fixed-length Unicode strings have a size limit of 4000, while variable-length strings can be larger (see below), EF Core defaults to a variable-length string.

3. nvarchar(max): The desired length is also unknown. Therefore, EF Core defaults to the maximum possible length, “max”.

What are the problems with those decisions?

1. ASCII strings consume half the memory of Unicode strings in SQL Server. If all you need is an ASCII string, then that’s the option to choose.

2. If you have extremely consistent data lengths, then fixed-length columns can be more efficient.

An example of both of these is a phone number. The necessary characters are all available in the ASCII character set. And phone numbers (even accounting for country codes and the nuances within each of those countries) are roughly equivalent in length.

3. Even though the effective length of the data may be very small, SQL Server must assume that the column can hold a large amount of data. Therefore, it makes decisions about resources based on this assumption.

Using the phone number example again, even though every row in the table holds data in this field with a length somewhere between 10 and 20 characters, SQL Server must assume that there is the possibility that it will store 231-1 characters.

For an example of how this can affect performance, see Eddie Wuerch’s response in this thread on sqlservercentral.com.

4. Data types like nvarchar(max) are known as large value data types or large object (LOB) data types. Data in these columns is handled differently than other types. For example, that data may be stored on a separate type of page (SQL Server’s designation for an 8k block of storage) than the rest of the row. This is called a LOB page. And this data can’t be indexed like other types, because normal indexes have a size limit.

Why does this matter?

Obviously, these are performance issues, and we should all keep performance in mind as we’re developing our solutions. But the bigger issue is that this behavior can go unnoticed in the early stages of developing an application. It’s only when the application is in production, with a sizable footprint, that we begin to see the harm that this causes. And as the application grows, these issues become harder and harder to fix.

So what do we do about it?

We solve this with two different annotations in our models. In EF Core 7, here is an example of each of these:

    [System.ComponentModel.DataAnnotations.Schema.Column(TypeName = "varchar")]
    [System.ComponentModel.DataAnnotations.MaxLength(50)]
    public string? MyStringField { get; set; }

As you can see, this indicates to EF Core that the string is still meant to hold variable-length data, but that it uses the ASCII data type and has a limit of 50 characters.

Summary

Once again, EF Core is not necessarily slow, but it is slow by default. And going along with the default behavior when it is the most inefficient option is a recipe for disaster. This is why many people, myself included, are inclined not to use it. But it is a popular ORM, so knowing how to make it better is vital.

The Case for Stored Procedures

In some parts of the C# community, I’ve noticed that there seems to be a desire to avoid writing SQL code. Some advocates for Entity Framework have even cited the ability to avoid writing SQL as one of its benefits. I admit that I am no EF expert. If I had to choose an ORM, my list stops at Dapper, and I normally roll my own data access layer.

But I’m not here to try to dissuade anyone from using Entity Framework. What I’d like to do, though, is lay out some of the benefits of embracing SQL in your solutions, specifically Transact-SQL and the use of stored procedures in SQL Server.

First let me share my typical design for a new solution based on a relational database. I not only use a database-first approach, but I build the database using scripts that are part of a repository. Everything after the CREATE DATABASE statement itself is in the repo. Each script is essentially idempotent. In other words, the script can be executed repeatedly without adverse effect. For tables, this means checking for existence before creating them. For changes to existing tables, additional logic is used to (for example) check for the existence of a new column before adding it. For other objects, this is mostly just using the “CREATE OR ALTER” syntax. I also take steps to maintain backward compatibility, but that’s for another post.

By the way, kudos to the attendee at my PASS Summit 2022 session who called me out for using the word idempotent. I’ve been throwing that word out for years, knowing that it’s not quite what I mean, but close enough. So, no, the scripts aren’t really idempotent, but I haven’t found a better word yet for, “I can F5 them to death.”

Anwyay, I also include PowerShell scripts in the repo that execute the T-SQL scripts on a database, and incorporate these into a CI/CD pipeline. I won’t explain the whole pipeline, but one crucial part is that a merge of an approved PR into the dev branch will trigger a deployment to a reference development environment. The scripts are all executed against the database(s) in this environment, and then application code is deployed to the servers.

The point I’m trying to make is that the database code is treated as part of the solution. It is in the repos just like the C# code. It is deployed just like the C# code. I even include unit testing, just like the C# code.

I also keep all T-SQL in the database. There is no T-SQL code inside the C# code. Stored procedures are used for all functionality, even basic CRUD functions. In fact, probably 80–90% of the stored procedures in my systems are essentially CRUD functions. The only operation that doesn’t go through stored procedures is bulk insertion, and I use SqlBulkCopy for that.

Here is an example of my approach, one that I put together for that Summit session: https://dev.azure.com/downshiftdata/_git/SearchOverflow

Why Do It This Way?

  1. Changes to the database are tracked in the repo just like changes to application code. So knowing who changed what, when they changed it, who reviewed it, when it got deployed, etc, is no more difficult to answer for database changes than application code changes.
  2. The stored procedures act as an interface layer, keeping database code in the database and application code in the application. Radical changes to the database are possible, without any effect on application code and without incurring any downtime. Performance issues in the database can be addressed quickly, through hotfixes, without redeploying the application. And the database design can take advantage of this interface layer. For example, instead of relying on AFTER triggers, audit logging can be done through OUTPUT clauses inside the stored procedures. This is because you can reasonably guarantee that all the writes are coming in through the procedures (and even support this assumption with appropriate authorization configuration).
  3. Stored procedures are compiled, with their plans in the procedure plan cache. Yes, this happens with ad hoc queries as well (and I’m an advocate for the “Optimize for Ad Hoc Workloads” option in SQL Server). But there are issues like parameter sniffing and cache bloat that need to be handled. Routing execution through stored procedures makes it easier to manage the cache and understand what is happening in the database.

The Other Side

I’ve seen quite a few arguments against approaches like this. Here are some of the more popular ones:

  1. If I use Entity Framework and take a code-first approach, then it does all the SQL for me and I don’t have to worry about it.

Once again, I’m not an EF expert, and don’t know the full scope of what it can and can’t do. But I’ve heard variations of this one quite a lot. And the argument falls apart like any other code-generation argument. It’ll work fine 90% of the time, and then you’ll be in trouble when you hit the other 10%. Now, in some situations, this may work out. You’ll always be inside that 90%. But to me, it’s a calculated risk, and one that I don’t have to make because it’s not the approach I take.

Where this one concerns me is in situations in which a DBA (or someone in a similar role) was essentially ignored. The DBA advocated for something less wizard-y, and had good reasons for it. But they’re often outnumbered and – frankly – not well respected, and decisions are made despite their arguments. And then the situation hits that 10% and the DBA’s list of options is limited. They often know the change they should make, but they can’t make it because that code is being generated in the application layer. Or the change they want to make will break application layer code.

  1. That’s the old way of doing things.

This makes me cringe every time. Yeah, we all wrote stored procedures twenty years ago, but that’s not how you do things today. This goes hand-in-hand with the idea that RDBMSs themselves are antiques and shouldn’t be used. I’m a firm believer in the KISS principle. And despite being ancient in the tech world, the RDBMS is a straightforward solution to a lot of problems. Would a big data store be better for a certain situation? Of course. Will your app ever reach a threshold that demands the advantages of that big data store? If not, then why add to the complexity of your solution? If your tech stack includes an RDBMS, then you have one product that can reasonably support a host of requirements.

  1. The database is the database and the application code is the application code and the two shouldn’t be mixed.

In my experience, those who make this case have had to suffer through thousand-line stored procedures that tried to do everything in the database. That is indeed a nightmare situation, and one that I’ve seen all too often.

But their response is to swing fully in the other direction. Don’t let any code reside in the database. But the truth is that there are scenarios in which it makes sense to go either direction. If you need to join several large tables, have good covering indexes, need a timely response, are faced with frequent cache invalidation, and only need a few rows in each call, it makes a mountain of sense to have that functionality in a stored procedure. Yes, that could be dynamically generated in the application layer. But there is definitely a non-zero chance of getting a bad execution plan in that scenario. And as I said before, your options then are much more limited.

Conclusion

Stored procedures form an interface layer, between the database and the application code. This layer of abstraction is incredibly beneficial to application security, maintainability, and scalability. Not using it is a design decision that should be considered very carefully.

When I am in conversations with others in the database community, or with coworkers who share some of my experience, I wouldn’t dream of having to make these arguments. But now that I’m back in C# land more than I used to be, I’m shocked at how often my opinions are at odds with others in this community. So, I don’t know if I’ve put any of those arguments to rest – probably not – but maybe that anti-stored-procedure crowd is at least thinking a bit more about the situation.

Back

Before COVID hit the United States in 2020…

  • I was a software engineer at an Indianapolis-based SaaS start-up in the human resources sector.
  • I was a soccer coach.
  • I had spoken at 22 different PASS-affiliated events over the previous four years.
  • I was a Program Manager for PASS.
  • PASS existed.

And now…

  • I’m still a software engineer, but for a different start-up, one that deals with AI in the defense industry. It’s based out of Austin.
  • After fourteen years, I’ve put coaching on an indefinite hiatus.
  • I haven’t spoken at any event.
  • PASS, as it was, is no more.

But PASS is back! SQL Saturdays are coming back! Redgate seems to be doing an admirable job of picking up the pieces, and carrying the torch forward. And, thanks to them, the Summit is back next week! And I’m speaking at it!

It’s a session entitled “Unit Testing T-SQL“. Arguably, I take a bit of a wandering path through DevOps to get there, but – in the end – I do indeed bring the discussion around to a way to perform unit testing in Transact-SQL.

What happens after this? I honestly don’t know. One reason I put aside coaching was to focus on my work more. I want to take that part of my brain that is focused on teaching and supporting others and move it in from the soccer field. So now it’s time to figure out what to do with it.

Also, one of these days, I finally need to actually go to Austin. Everyone tells me it’s nice there.