What We’ve Learned About Sharing Our Data Analysis

Publishing reproducible data that’s genuinely useful

From BuzzFeed’s H–2 visa investigation

Last Friday morning, Jessica Garrison, Ken Bensinger, and I published a BuzzFeed News investigation highlighting the ease with which American employers have exploited and abused a particular type of foreign worker—those on seasonal H–2 visas. The article drew on seven months’ worth of reporting, scores of interviews, hundreds of documents—and two large datasets maintained by the Department of Labor.

That same morning, we published the corresponding data, methodologies, and analytic code on GitHub. This isn’t the first time we’ve open-sourced our data and analysis; far from it. But the H–2 project represents our most ambitious effort yet. In this post, I’ll describe our current thinking on “reproducible data analyses,” and how the H–2 project reflects those thoughts.

What Is “Reproducible Data Analysis”?

It’s helpful to break down a couple of slightly oversimplified definitions. Let’s call “open-sourcing” the act of publishing the raw code behind a software project. And let’s call “reproducible data analysis” the act of open-sourcing the code and data required to reproduce a set of calculations.

Journalism has seen a mini-boom of reproducible data analysis in the past year or two. (It’s far from a novel concept, of course.) FiveThirtyEight publishes data and re-runnable computer code for many of their stories. You can download the brains and brawn behind Leo, the New York Times’ statistical model for forecasting the outcome of the 2014 midterm Senate elections. And if you want to re-run Barron’s magazine’s analysis of SEC Rule 605 reports, you can do that, too. The list goes on.

In an ideal world, reproducible data analysis finds readership among experts and laypeople alike. Laypeople can question, and learn from, the reporters’ general approaches: What data did they use, and what decisions did they make en route to their findings? More technically-inclined readers can download the raw materials and reconstruct the findings from scratch. They can test alternative approaches, and inspect the code for bugs.

Why Reproducible Data Analysis?

At BuzzFeed News, our main motivation is simple: transparency. If an article includes our own calculations (and are beyond a grade-schooler’s pen-and-paper calculations), then you should be able to see—and potentially criticize—how we did it.

And that holds us accountable. Indeed, the very prospect of public scrutiny forces us to be as lucid and straightforward as possible. It discourages us from cutting corners. It lights a fire under our proverbial posteriors, and improves our work.

This approach also spurs a shift in perspective, by conjuring the mindset of an outside observer. (Conjuring actual outsiders helps too, of course. I discussed our analysis, and key questions about the data, with the Department of Labor’s helpful experts.) Asking yourself, What might someone find problematic about this? ends up being a pretty good way to fact-check an analysis. Better, I’d argue, than simply asking, Does this look right? The distinction may seem minor, but in practice it is significant.

There are reasons, of course, not to publish a fully-reproducible analysis. The most obvious and defensible reason: Your data includes Social Security numbers, state secrets, or other sensitive information. Sometimes, you’ll be able to scrub these bits from your data. Other times, you won’t. (A detailed methodology is a good alternative.)

How To Publish Reproducible Data Analysis?

At BuzzFeed News, we’re still figuring out the best way to skin this cat. Other news organizations might be arrive at entirely opposite conclusions. That said, here are some tips, based on our experience:

Describe the main data sources, and how you got them. Art appraisers and data-driven reporters agree: Provenance matters. Who collected the data? What universe of things does it quantify? How did you get it?

Explicitly link each data-dependent passage to the code that supports it. The main page of the H–2 repository includes a section titled—imaginatively—“Analyses”. In that section, we enumerate each number-crunched statement, and then link to the specific pages of code that buttress that statement. It’s the first time we’ve taken this particular approach, and I’m happy with it. It serves a dual purpose: To make extra-certain that we’ve backed up each analysis, and to make the analysis easier to follow.

Summarize your methodology before you introduce any code. In earlier projects, we’ve tended to place methodological notes between chunks of relevant code. That’s a typical approach for a programming project. But there should also be an easily-readable and easily-findable basic explanation for non-programmers, ideally at the top of each relevant page.

Use Makefiles. Programmer Mike Bostock, formerly of the New York Times, has written eloquently about Makefiles. In his words: “Makefiles are machine-readable documentation that make your workflow reproducible.” Indeed, it’s a Makefile that makes our H–2 analysis easily reproducible. Once you’ve copied the GitHub repository to your (Unix-based) computer, and installed the required (but free) Python libraries, try executing make all on your command line. Doing so will reproduce our full workflow.

Try the Jupyter Project. Formerly known as IPython notebooks, Jupyter notebooks provide an easy way to interleave code and formatted text (written as Markdown). Similar projects exist for other languages, but (a) Jupyter supports a growing list of languages, and (b) GitHub now renders Jupyter notebooks directly on the site. We used them throughout the H–2 project.

Have other tips for publishing reproducible data analysis? I’d love to hear them. Share your recommendations in the comments below, or email them to jeremy.singer-vine@buzzfeed.com (work) or jsvine@gmail.com (personal).

Note: I’ve tended toward the plural “we” for a reason. A dozen hands—counting two per person, if we’re going to be transparent about methodology—contributed to the published analysis. Jessica and Ken, of course, provided substantial input on how best to approach and interpret the data. John Templon ran a fine tooth-comb through the code, flagging questionable assumptions and overly idiomatic syntax. And Mark Schoofs and Kendall Taggart provided clarifying edits. A huge thanks to all involved.




Current page