Big Files, ETL Woes, and PowerShell

I have a love/hate relationship with PowerShell. It’s role in the grand scheme of my work is an extraordinarily valuable one. But the syntax often drives me nuts. It’s like bash, C#, and old COM-era Visual Basic got smushed together.

But, that rant aside, when I can figure out the syntax, I love the functionality. Here’s one example I’m sharing mostly for when I’ll inevitably need it later myself.

It came from working with enormous text files, and the gotchas that come with them. In this case, I had a 50gb data file that had something wrong with it, buried about 25% in. The file was being processed just fine, until it hit this unexpected data. And because the ETL app was written for performance first, there wasn’t a lot of data validation being done. So it’d just go boom when it hit that batch.

So what was wrong with the file? Well, in order to determine that, I had to see what was in that batch. But you can’t just open a 50gb file in Notepad. Yes, there are other ways to get around this, but here’s the one I chose:

Get-Content {bigFile} | Select-Object -Skip 10000000 -First 20000 | Out-File -FilePath {newFile}

It’s pretty evident what this does. But just to be clear, it skips the first 10 million rows in the file, then writes the next 20,000 rows out to a new file.

Here’s the thing… from my googling, I was halfway expecting this not to work, because it seemed like Get-Content would chew up memory. Was it really unloading rows once it read them? Or was it going to die after eating up all the available memory? As it turns out, it appears to have been doing the former. Slowly, I’ll admit, but performance here was not my biggest concern. I just wanted to get a manageable chunk out of the middle of the file without blowing up memory usage. And that’s what I got.

I was able to track down the errant data – an opening quote with no closing quote – once I had the bad batch isolated.

So, a small win, but a win nonetheless. Which is pretty much on par for my experience with PowerShell. Hope this helps someone else, too.

Running PowerShell files in SQL Operations Studio

As I’ve used SQL Operations Studio more and more, I’ve also been finally using PowerShell in more situations. Given that I like the editor and that there’s a built-in terminal, I’ve been running those in my Ops Studio instance. But for a while I didn’t have a slick way of running an entire PowerShell file in the terminal. Usually, I’d just Ctrl+A/Ctrl+C/Ctrl+V, which is a bit awkward.

But among all the other ways you can customize Ops Studio, you have a lot of control over the key mappings. One way to edit these mappings is to pull up the Command Pallette (Ctrl+Shift+P) and start typing “key”, and you’ll see “Preferences: Open Keyboard Shortcuts”. You’ll also see it mentions the Ctrl+K/Ctrl+S shortcut. This will bring you to the basic Keyboard Shortcuts window, where you’ll need to click “keybindings.json”. Either way, just like Ops Studio’s overall settings (and VSCode’s, for that matter), you get a JSON file you can now tweak. Actually, two of them, with the defaults on the left and your own settings on the right.

Here’s the mapping I’ve added:


{
"key": "shift+f5",
"command": "workbench.action.terminal.runActiveFile",
"when": "editorTextFocus && editorLangId == powershell"
}

The effect of this is that pressing Shift+F5 while focus is on your PowerShell script file will cause Ops Studio to run that file in the terminal window. As an old SSMS and Visual Studio user, F5 seemed natural to me, and I noticed that Shift+F5 wasn’t already taken.

Note that this runs the file, not necessarily what you have in the window. So you may want to precede this with Ctrl+S while you’re working.

My gratitude to ck (twitter|blog) for pointing me in the right direction here. When it comes to either Ops Studio or PowerShell these days, I just assume he’s smarter than me.