Notes on Working with Big-ish Data
What we learned handling data just big and messy enough to be a giant pain
Mike Stucka of the Palm Beach Post wrote up his experience of crunching biggish-data for a recent story on the heroin-abuse epidemic and posted it on the NICAR-L listserv. We got his permission to publish it here, too, to help boost the signal. —eds
My coworkers thought some people on the list might benefit from some lessons learned, and perhaps start a discussion, on working with “pretty big data.”
I finished a project with a home-built table that was about 16GB, some 60 million rows by 110ish fields. It was…big. Sometimes it was painful. Mostly, though, it worked out, and it got us what I think is a damned good story.
Anyway, I think it was Ben Welsh who’d observed something like: We have some good tools to work with Big Data, but not great tools for data that’s not quite so big. I ran into that situation.
Some things I learned:
Standardize Everything. Everything.
We combined two disparate but related datasets on emergency room visits and inpatient visits. Field names were sometimes different. One set had room for about 20 diagnosis fields the other didn’t have. Standardize everything you can, rather than what you think you might need. If you want to cringe at a tale of maintaining parallel queries for mostly similar datasets, give a talk from NICAR16 a listen.
Sometimes the Best Indexing Is no Indexing at All
I built indexes based on the queries I thought I’d be running against the dataset. When I saw my indexes were 7GB, I got rid of ’em all and never missed ’em. To steal a thought from the movie War Games, sometimes the only way to win is not to play at all.
Booleans are a Good Way to Turn Complex Into Simple
Want to know if the patient had a heroin overdose? Make it a Boolean, like “ODheroin.” Run an update query once to set the flag, and then it’s nearly instantaneous to tally it.
Give Yourself Extra Variables. Many.
It took two hours to add a new column. It took two seconds to rename a column. So I built a stable of reserve columns in my SQL, like Boolean variables b1, b2, b3, … b9. When reporting needed more variable, I’d grab one from the reserve and create it for real next time we had to run an import, thus replenishing the stock of spare Booleans. This saved an incredible amount of time.
Work Smarter, Not Harder
I used Google Sheets to helped build boilerplate SQL. It was bad enough to look for a single diagnosis variable in more than 30 different fields; it was much worse to look for a pile of variables in a pile of fields. Formulas in Sheets/Excel solved it.
="update " & A15 & " set " & C15 & "=true where prindiag in (""" & D15 & """) or othdiag1 in (""" & D15 & """) or othdiag2 in (""" & D15 & """) …
What did this save? Remnants of my sanity. A single update query ran more than 35,000 characters; yes, a simple “set this to true where” kind of query was 35kb.
"Sum(case" is a great tool, and greatly simplifies life. I could run queries through my main 16GB table in about eight minutes most of the time, especially if I was doing some cleaning up in Excel. And of course you also get the benefits of Boolean logic, allowing you to structure some fairly complex conditions. We did publish our methodology with our story, and here’s the real gist of that story, where the $4.1 million a day came from:
sum(case when prenatal=true or
(MiamiInfections=true and MiamiOpioids=true) or
(HepC=true and MiamiOpioids=true) or
tchgs else 0 end) as “Drug-related charges.”
"MiamiInfections" Boolean alone is triggered by more than 3,000 different combinations of codes and fields, and it’s just a small part of this query. All the prep work, again, got me really intricate results in about eight minutes for a 16GB table.
- Big-ish Data is big-ish, and you should plan accordingly. I came into this job to a computer with a 250gb drive and could not delete enough stuff to work with this data. Then the replacement 1gb drive started dying and took part of the operating system with it. A solid state drive would have dramatically sped things up and probably improved reliability, but I didn’t have one.
- Big-ish Data can be unpredictable in terms of scale. Quantity has a quality all its own. I had tools to convert DBFs (CSVKit, LibreOffice) but nothing could handle 1.1GB DBFs. (My pain is yours; we released a bulk DBF-to-CSV converter that works fairly efficiently.)
- Infrastructure matters. I gather a newer version of MySQL might have had better performance. Once I got going on this, I really didn’t want to upgrade. And I still haven’t.
- Contextual editing is really nice. I wrote most of the SQL in a Google Doc for shared editing, but really missed the benefits of a real text editor.
Please scream with any comments, questions, suggestions, or insults.
Mike Stucka became The Palm Beach Post’s self-titled data dork in early 2016, after finding ways to bring data into regular beat work for The Telegraph of Macon, Ga., The Salem (Mass.) News, and other publications all the way back to his undergraduate newspaper.