Big Files, ETL Woes, and PowerShell

I have a love/hate relationship with PowerShell. It’s role in the grand scheme of my work is an extraordinarily valuable one. But the syntax often drives me nuts. It’s like bash, C#, and old COM-era Visual Basic got smushed together.

But, that rant aside, when I can figure out the syntax, I love the functionality. Here’s one example I’m sharing mostly for when I’ll inevitably need it later myself.

It came from working with enormous text files, and the gotchas that come with them. In this case, I had a 50gb data file that had something wrong with it, buried about 25% in. The file was being processed just fine, until it hit this unexpected data. And because the ETL app was written for performance first, there wasn’t a lot of data validation being done. So it’d just go boom when it hit that batch.

So what was wrong with the file? Well, in order to determine that, I had to see what was in that batch. But you can’t just open a 50gb file in Notepad. Yes, there are other ways to get around this, but here’s the one I chose:

Get-Content {bigFile} | Select-Object -Skip 10000000 -First 20000 | Out-File -FilePath {newFile}

It’s pretty evident what this does. But just to be clear, it skips the first 10 million rows in the file, then writes the next 20,000 rows out to a new file.

Here’s the thing… from my googling, I was halfway expecting this not to work, because it seemed like Get-Content would chew up memory. Was it really unloading rows once it read them? Or was it going to die after eating up all the available memory? As it turns out, it appears to have been doing the former. Slowly, I’ll admit, but performance here was not my biggest concern. I just wanted to get a manageable chunk out of the middle of the file without blowing up memory usage. And that’s what I got.

I was able to track down the errant data – an opening quote with no closing quote – once I had the bad batch isolated.

So, a small win, but a win nonetheless. Which is pretty much on par for my experience with PowerShell. Hope this helps someone else, too.