The Case for Flat Files in Big Data Projects
Time opens up the enormous Open Payments dataset using flat files and Ajax
The highly anticipated release of the Open Payments data this week by the Centers for Medicare and Medicaid Services arrived with a helpful disclaimer in the README file:
[The] latest versions of Microsoft Excel cannot display data sets with more than 1,048,576 rows. Some of these CSV files may exceed that limit.
You don’t say. The juiciest file, which contained information on payments that medical companies made to physicians for things like meals, travel, and consulting fees, was 2.6 million lines in a single 1.44 GB file.
ProPublica has lovingly chronicled the dataset’s many flaws, which boil down to the fact that about 40 percent of it is de-identified. While others have generally withheld judgment for this reason, we at Time felt that there was still tremendous value in working with what we have: a lot more information than we had a week ago.
This is the sort of dataset for which an interactive treatment is essential. One hopes to engage readers with the finer analyses, but what people want to know first is whether their doctors are listed in the data. This presented a technical challenge, because our small news apps team at Time is a lot more comfortable with client-side web development than we are with administering servers and databases. So we (Pratheek Rebala and I) decided to make the whole thing searchable using only flat files and Ajax requests.
This is not necessarily a very sane way to handle a large dataset. It would have been more expedient to load the data into MySQL (which we did, for offline analysis), write a Node.js Express app to fetch queries and return the results, and ping the app from the page. (Insert whatever database and language you want in there.)
But there’s a significant cost in that sort of undertaking: One needs to get the server up and running, monitor the traffic demands, scale it up and down as necessary, add a caching layer to prevent repeated requests from hitting the database each time, and otherwise worry about whether it’s working. The server and database need to be available for the entire lifetime of the article, which is ideally forever.
Flat files, on the other hand, get uploaded to a CDN and live there happily ever after.
To paraphrase Hippocrates: Databases are short, files are long.
To get this to work, we had to create a directory of all the doctors in order for the autocomplete to work when people searched for a physician. To deal with the CSV files, we read them in line-by-line with a little Node.js script that we’ve since made available on GitHub and npm. Splitting all the names into files of the first three letters of their last name was sufficient to get reasonably small files—anywhere from a 500 bytes to about 250 KB for very common prefixes like “SCH.” The vast majority were under 2 KB.
After a user enters three letters into the search box, the page fetches the relevant directory file (“WIL.json”, “REB.json”, etc.), which contains the name and unique identifier for each doctor with a surname that begins with those three letters. These names are fed into the autocomplete function.
Meanwhile, we used another Node.js script to interface with the offline MySQL database and generate a file for each doctor detailing each payment he or she received. These are labeled with the unique identifier (e.g. “45038.json”).
When users click on a name in the autocomplete box, the page makes a second Ajax call to the file with that doctor’s unique ID. It takes the response and visualizes it in a table on the page.
To host the files, we used Amazon S3 with a CloudFront layer for high availability. Amazon makes it reasonably simple to allow for cross-domain requests from your site’s domain, unlike some cloud file services. (Looking at you, Rackspace.)
The biggest technical challenge we faced was that we had over 350,000 flat files to transfer to the S3 instance. The files themselves are typically only a few KB, but the usually means up uploading them was running at about 30 files a second. We eventually found a lifesaving tool called s3-parallel-put, which uses Boto, a Python interface to Amazon Web Services, to push files to s3 using parallel streams of data. That considerably decreased the upload time to about 30 minutes.
Accomplished backend developers will be rightfully scornful of this approach. For one thing, it could never efficiently accommodate logical queries. (“Show me pediatricians in California who received more than $100.”) But in our experience, there are many, many more developers with access to a CDN than there are with access to a server and the know-how to write effective server-side code. For projects that don’t involve a lot of data manipulation and custom queries, flat files are much easier to work with. And let’s see which project is still up and running ten years from now.