Big Files, ETL Woes, and PowerShell

I have a love/hate relationship with PowerShell. It’s role in the grand scheme of my work is an extraordinarily valuable one. But the syntax often drives me nuts. It’s like bash, C#, and old COM-era Visual Basic got smushed together.

But, that rant aside, when I can figure out the syntax, I love the functionality. Here’s one example I’m sharing mostly for when I’ll inevitably need it later myself.

It came from working with enormous text files, and the gotchas that come with them. In this case, I had a 50gb data file that had something wrong with it, buried about 25% in. The file was being processed just fine, until it hit this unexpected data. And because the ETL app was written for performance first, there wasn’t a lot of data validation being done. So it’d just go boom when it hit that batch.

So what was wrong with the file? Well, in order to determine that, I had to see what was in that batch. But you can’t just open a 50gb file in Notepad. Yes, there are other ways to get around this, but here’s the one I chose:

Get-Content {bigFile} | Select-Object -Skip 10000000 -First 20000 | Out-File -FilePath {newFile}

It’s pretty evident what this does. But just to be clear, it skips the first 10 million rows in the file, then writes the next 20,000 rows out to a new file.

Here’s the thing… from my googling, I was halfway expecting this not to work, because it seemed like Get-Content would chew up memory. Was it really unloading rows once it read them? Or was it going to die after eating up all the available memory? As it turns out, it appears to have been doing the former. Slowly, I’ll admit, but performance here was not my biggest concern. I just wanted to get a manageable chunk out of the middle of the file without blowing up memory usage. And that’s what I got.

I was able to track down the errant data – an opening quote with no closing quote – once I had the bad batch isolated.

So, a small win, but a win nonetheless. Which is pretty much on par for my experience with PowerShell. Hope this helps someone else, too.

Tigers in the Rain

This is the first in what may become a series of anecdotal posts about what I’ve learned as a coach and how I think it translates to becoming a good manager…

After the match, my brother told me he knew how it was going to go when he saw them in warm-ups. The other team, he said, looked like they didn’t want to be there, shuffling around half-heartedly in the drizzle. In contrast, our girls were all smiles.

They’d been there before.

During the previous season, the TC Tigers hosted Shelbyville, a larger nearby school and a fierce rival. The rain came down in buckets that night. It was their last season on the grass, and the field held up surprisingly well. So did the girls, who beat the Golden Bears 3-1.

I’d never seen a match played in those conditions… until that night at the 2017 Sectional championship. At one point, they had to call the match on account of lightning, and the teams were sent off to the locker rooms to wait it out. But it eventually resumed, and so did the dogfight. Heritage Christian may have come into the game as heavy favorites, but our girls held their own and held the score 0-0 through regulation and two overtimes.

Then they won the shootout. And the hardware.

When he shared his observation with me afterward, I realized the team had achieved something else as well. Not only could they play well in the rain, they believed they could play well in the rain. And I also realized that it’s up to me to keep that belief alive.

We’re preparing them for a new season now. Only seven of the girls on the current roster were part of that Shelbyville game. That doesn’t matter, though. I’ve brought up those rain matches a couple of times in the off-season, when I’ve had a mix of those who were there and those who weren’t. I’d get them going, and then let the veterans reminisce about what it was like to play in those matches and to come out on top. And let the rookies soak it in.

The girls are going to win some and lose some. That’s the nature of the game. But if it rains during a match this season… well, I almost feel sorry for the other team. Are they that much better than every other team in the rain? Maybe not. But they believe they are that much better – even the girls who weren’t at those two matches.

A development team won’t ever be asked to write code outside in the rain. But they’ll have their own rain matches. A production outage, a performance issue, an angry customer… who knows what the situation will be? But there will inevitably be something – some bugaboo that they’ll be able to overcome once or twice. It’s up to the manager to spot those successes and capitalize on them. It’s up to the manager to nudge the team toward embracing them and allowing them to become part of its identity. It might take nothing more than a “Remember when…” at a team lunch. Whatever it takes, though, the important thing is to gently – imperceptibly – encourage them to believe in themselves, and in each other. Especially when the rain comes.

tsqllint

My PASS local group, IndyPASS, has its monthly meeting tonight. At my insistence, first-time presenter Nathan Boyd is showing off a SQL tool called tsqllint. Nathan, a coworker of mine at Salesforce, is the leading developer behind this GitHub project.

A lint (or linter), if you didn’t know, “analyzes source code to flag programming errors, bugs, stylistic errors, and suspicious constructs” (wikipedia). This one is designed specifically for T-SQL, is highly configurable, and includes a Visual Studio Code extension. What more could you want, right? If you want cleaner T-SQL code out of your developers, with less hassle on the part of your reviewers, it’s definitely worth your time.

If you’re in the area, keep in mind there’s a location change tonight. While IndyPASS usually meets at Virtusa, 1401 North Meridian (formerly Apparatus), this month’s meeting is at Moser Consulting in Castleton. As usual, doors open at 5:30pm, and we’ll turn it over to Nathan by about 6:15pm.

T-SQL Tuesday #104

My thanks to Bert Wagner and his chosen topic for T-SQL Tuesday, Code You Would Hate To Live Without. It was just enough of an excuse to dust off the cobwebs here and get back to posting.

Anyway, since half of my time is spent in C#, I thought I’d venture into that world for my response. I’ll share a couple of common extensions that I include in most of my projects. Extensions, as their name implies, extend the functionality of existing objects. Here is a code snippet with a couple of extensions I typically add:

namespace myproj.Extension
{
  public static class Extensions
  {
    public static bool In(this T val, params T[] values) where T : struct
    {
      return ((System.Collections.Generic.IList)values).Contains(val);
    }

    public static object ToDbNull(this object val)
    {
      return val ?? System.DBNull.Value;
    }

    public static object FromDbNull(this object val)
    {
      return val == System.DBNull.Value ? null : val;
    }
  }
}

The first method enables me to easily search enumerations for a given value. For example, if I’ve defined this enumeration:

namespace myRacingProject.Enum
{
  public enum Series
  {
    None = 0,
    Indycar = 1,
    IndyLights = 2,
    ProMazda = 3,
    Usf2000 = 4
  }
}

Then I could use the extension like this:

if (mySeries.In(Enum.Series.ProMazda, Enum.Series.Usf2000)) myChassis = "Tatuus";

As for the other two methods, well… When is a null not a null? When it’s a System.DBNull.Value, of course! SQL Server pros who have spent any time in the .NET Framework will recognize this awkwardness:

var p = new System.Data.SqlClient.SqlParameter("@myParam", System.Data.SqlDbType.Int);
p.Value = (object)myVar ?? System.DBNull.Value;

With the extension, the second line becomes:

p.Value = mVar.ToDbNull();

Similarly, when reading, this:

var myInt = (int?)(myDataRow[myIndex] == System.DBNull.Value ? null : myDataRow[myIndex]);

Becomes this:

var myInt = (int?)myDataRow[myIndex].FromDbNull();

They’re not earth-shattering improvements, but my real point is that extensions are an often-overlooked feature that can improve your quality of life as a developer. Anytime you find yourself writing the same bit of code over and over, especially if that bit is rather unsightly, you might consider making it an extension.

Want to know more? Here ya go: https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/classes-and-structs/extension-methods

T-SQL Tuesday, Late to the DevOps Party

This is what I get for slacking off on my blogging…


So, I totally missed this month’s T-SQL Tuesday. Which is a shame, because it’s an area I have a lot to talk about. Grant Fritchey hosted T-SQL Tuesday #91, which is all about DevOps. There were a lot of good posts. I liked Rob Farley’s comment: “DevOps is the idea of having a development story which improves operations.” Andy Yun had a nice take, closing with “This is what I believe DevOps is all about. The tools and processes being pioneered today help all of us build better, more stable software, which is better for all of us.

My own personal experience with what I think of as DevOps took a big leap about a year and a half ago when I moved from a product group in my organization to a development-oriented team inside our operations group. That transition both confirmed and challenged an observation I’d made years ago.

My observation was that organizations are all on a track between two extremes. On the one extreme is the legendary “two guys in a garage” story. On the other is a Fortune 500 conglomerate, doing business in tightly-regulated industries. In that garage, Woz was free to design and innovate at will, and what he produced is now the stuff of legend. Apple, on the other hand, delivers a new product only after a monumental expense of time and resources. They still produce remarkable products – one cannot argue with the massive success of the iPhone and its successors. But how those products come to market is a far cry from that Homebrew Computer Club meeting in ’76.

The critical piece of this observation is that all organizations move from the former end of the extreme to the latter. They may move at different rates, even coming to a stop for any length of time, but they never go the other direction. Let me repeat that: They never go the other direction.

Let’s think about database backups as an example. Garage organizations may not do any backups at all. Then something gets lost, and perhaps a weekly full backup task is scheduled. The system grows and taking a simple full backup once a week no longer scales properly, so a better schedule is created, with more particulars about what is backed up, when, and how. This process keeps getting refined, until, one day, there is a standard procedure for performing backups, including off-site storage, regular testing of the restoration process, and all kinds of other aspects that operations people in Fortune 500 companies have to think about.

Many of us are quick to claim that the Garage is better. We love the folklore around how Apple (and other companies like it) got started. We love to reminisce about those times that we turned caffeine into code and all was right with the world. It’s just something about being a coder. But that’s not universally true. Would your organization be better off without a good backup restoration procedure?

The problem is that – good or bad – the processes pile on. Bad or obsolete procedures never get removed, just marginalized until they become an indistinguishable part of the Fortune 500’s ecosystem.

I help administer a database that does not need to be backed up. Yes, that’s correct. It does not need to be backed up. Ever. It’s a data warehouse, of sorts, that is populated by a convoluted ETL process originating from log files. Due to both size and the diminishing value of old data, we only keep data in this database for a short time. If we were to lose the database, we would recreate the structures, and then “re-dump” the log files into the ETL source folder, and let the process churn through them again. That method of recovery would only take marginally more time than restoring from backup. By the way, for this database, the backup process itself is a serious drain on system resources.

Getting our operations group to STOP backing up this database was an adventure, because of all of the procedures in place to ensure that every database was properly backed up. At one point, because it was causing a production issue, I disabled the backup job (thereby curing the immediate issue). For my quick thinking, I was rewarded with having my SQL Agent rights revoked.

Anyway… the moral of my story is that the situation gets more complex over time. Once again, it never gets simpler. At least, that was my theory.

Enter DevOps.

To me, this is precisely the point of DevOps: The ability to go backwards on this track.

Now, here’s the hard part. To do this – to move towards the Garage end of the track – takes people who understand what is important and what isn’t. You can’t trim excess process at whim. You have to know what should get trimmed and what shouldn’t. Because you’re removing processes that protect the organization and limit risk, you need smart and organized people who will do the right thing without a process directing them.

DevOps is all about the people. It takes good communication. In his T-SQL Tuesday post, John Morehouse emphasized it with his call to “Start talking.” David Alcock joked about Dev and Ops going to “couples therapy” (and did indeed make a good point). But it’s not just good communication. It’s good people. For DevOps to work, DevOps people must be smart about what they do and don’t do. And then, if they’re successful, they’ll do what I didn’t think was possible – move from the Fortune 500 end of that track back towards the Garage.

As long as they don’t get their SQL Agent rights revoked in the process.

Threading

A while back, I wrote an app that spawned a collection of threads to run some work in parallel, using the resources in the System.Threading namespace of the .NET Framework. Some time after that, I worked on another app that also had a threading component. This second app was reviewed by another developer from outside my immediate circle. He asked, “Why didn’t you use the System.Threading.Tasks” namespace? Uhh… because I didn’t know it existed?

That namespace was introduced in .NET Framework 4 – not exactly recent history – but I had somehow missed it for quite a long time. There are a few causes for that, but the one I’d like to focus on here is a trap that I think catches many developers at one time or another: We think we have it all figured out. While we are, to some degree, practical mathematicians – professionals who assemble algorithms to meet requirements – we are also creators. Our code is our art. And oftentimes, we don’t have the humility necessary to accept the possibility that our art isn’t beautiful. So we shy away from having the right people check our work.

This reminds me of an old saying: If you’re the smartest person in the room, then you’re in the wrong room.*

Now, this is not a commentary on my current team. I work with some really smart people, and I’m very grateful for that. But while my teammate may be one of the best PHP or Node.js coders I know, that doesn’t necessarily translate to an expertise with the .NET Framework. The true test is this – no matter how smart they are, if they’re not catching my mistakes, then I’m not being held accountable.

Lesson 1: Make sure someone’s catching your mistakes. If they’re not, then do you really think the reason is that you’re not making any?

So, back to the two apps… After the other developer’s feedback, I reworked the second one prior to release, and it passed its code reviews. The first app, meanwhile, developed some bad behavior in production. There was definitely a race condition of some sort, but I couldn’t seem to nail down where it was. I made a couple of adjustments to the code, but nothing seemed to bite. Of course, I couldn’t reproduce it in testing either.

Finally, I ripped out the threading code entirely and replaced it with nearly identical code based on System.Threading.Tasks. I was concerned about the risk of introducing more bugs, about the fact that I was still unable to reproduce the problem, and about how long it had been a problem, so I tried to remain as faithful to the original design as possible. And, yeah, honestly, I crossed my fingers.

Once this new version was released, the problem was gone.

Lesson 2: System.Threading.Tasks really is better than System.Threading.

I’ll never know what exactly fixed the problem. I could keep researching it, but the costs to me for that aren’t quite worth the benefits at this point. My takeaway was that the new stuff just simply works better. Whether that’s because it’s easier to use the right way (and harder to use the wrong way) or its internals are less buggy or some combination thereof, the end result is the same. I hope that’s old news to anyone reading this, but I wanted to share my experience just in case.

* I was unable to identify with certainty the source of this phrase. The leading candidate I found was 1962 Nobel Laureate James Watson.

A Christmas Wishlist

I finally had the chance to catch up on some reading last week, and it got me thinking about sharing this list. If you are in software development, then I would consider these four books to be required reading. I’m going to revisit this post in the future, because I’m sure there are more to add. But I wanted to start somewhere, and I’m confident about these four.

Fonts and Frustration

TL;DR – There are a couple of XML files at the end of this post. If you regularly present technical material using SSMS, download these.

I present technical sessions now and then – my local PASS group, SQL Saturdays, internal groups at my workplace, etc. I frequently find myself adjusting the fonts inside SQL Server Management Studio to make sure my material is readable on the big screen. I’ve also been in the audience plenty of times, watching with sympathy as one of my cohorts agonizingly navigates this problem.

Usually, it goes something like this. They first find the [100%] tucked away in the lower left corner of the text window, and blow that up to 150 or 200 percent. Then they run their query to find that the results are still at 100%. So then they eventually find the Options dialog under the Tools menu, find the Fonts and Colors branch of the tree, and then groan when they realize they have to figure out which three or four of the 30 different fonts they need to change. Sometimes, they’ll give up there and just go use ZoomIt (which any good technical presenter should have available anyway), but constantly bouncing around with ZoomIt will get old quickly over the course of an hour-long session.

But if they do manage to find the right fonts to change and take a good stab at what they ought to be, they get this wonderful message:

Font Frustration

Just the thing you want to see when you already have all your demo scripts loaded, right?

Oh, and don’t forget that – when the session is over – you now have to go through the same exercise to get SSMS back where you had it before the session.

So quite a while ago, I generated a couple of .reg files for myself, one called PresentationFonts.reg and one called NormalFonts.reg. You can imagine what these did when I applied them to the Windows Registry.

That worked great… until recently. The SQL Server Tools team has done some marvelous things with SSMS lately, and I’m very happy with the changes. But take a close look at one of those things they did:

Version Information

And where does this new shell keep its settings? Here’s a hint – it’s not in the registry. It’s actually in this file:


{LocalApplicationDataPath}\Microsoft\SQL Server Management Studio\13.0\ApplicationPrivateSettings

And this file is some bizarre hybrid of XML, JSON (with JSON inside of JSON, no less!), and I don’t know what else.

Fortunately, there is an option available. Under the Tools menu, there is “Import and Export Settings…”, which gives you a wizard for importing some or all settings from an XML file. So, with that in mind, here are my files that I use.

  • PresentationFonts.vssettings – This changes the font size to 16 for Text Editor, Execution Plan, Grid Results and Text Results.
  • NormalFonts.vssettings – This changes the font size to 9 for Text Editor, Execution Plan, Grid Results and Text Results.

NOTE: When you save these, save them with the .vssettings extension. Since I’m a cheapskate and use wordpress.com to host this blog, I’m prevented from using whatever extension I want. So they’ll show up as .doc files in your download dialog, but they really are just text XML. And the Import/Export wizard looks specifically for .vssettings files.

Obviously, you may not use the same settings I do, and you’ll have to customize them for your own uses. If you change the same four that I do, then all you have to do is fiddle with the sizes in the files. If you wish to change different fonts, you’ll want to export your settings with that wizard, change the font you want, export them again, and compare the files in order to figure out which GUID is which.

In any case, I strongly recommend having a pair of files like these parked right next to your demo scripts, where you’ll remember to run them as you prepare for your session.

I know this is a rather long post for a rather small tip, but I’m amazed at just how many of us fight this problem. If I had a dollar for every time I’ve seen a presenter struggle with font sizes, my Azure subscription would be paid for.

And So We Meet Again

This is not a technical topic, but one I find myself very passionate about. An article today at NPR brought it to the forefront. As my organization matures, like every organization, rampant movement is slowly replaced by more controlled but slower movement. Notice I used the word “movement”. As Hemingway said, “Never mistake motion for action.” What maturity in an organization often brings is a better ratio of action to motion. We do a better job of working on only those things that really matter, not going off on adventures that may never make their way to production. Or developing with a ready, fire, aim approach. But it also wraps that action in more red tape. It’s a trade-off that every organization encounters.

Anyway, meetings are a highly visible part of that red tape. And the article covers the topic pretty well. I just have one thing to add. It’s a little math that I’ve always kept in the back of my mind at every meeting. Here’s the formula:

Px = T * W

Or, Productivity Lost = Time of the Meeting X the number of Workers at the meeting.

If you are running a meeting for 6 knowledge workers for 2 hours, assume that you just lost 12 hours of productivity. Was that meeting worth those 12 hours? It might have been. But that’s the trade-off. Remember, everyone’s time is valuable.

So, if you have any control over meetings at your organization, please keep that formula in mind. I always do.