When the News Calls for Raw Data

Thoughts on recent dataset postings from BuzzFeed and the New York Times

From the NYT’s interactive on the distribution of military surplus equipment to police.

Last Friday, the New York Times published a map visualizing the distribution of military surplus gear by US county. The map was produced by graphics editors Tom Giratikanon, Alicia Parlapiano, and Jeremy White, based on the data underpinning Matt Apuzzo’s June NYT article, War Gear Flows to Police Departments. Apuzzo’s article had been heavily cited in the last two weeks as as useful background for a discussion of militarized policing in Ferguson, MO and other US cities.

On Tuesday of this week, the team at the Upshot posted to GitHub the full dataset they used to make the map. In a blog post on the Upshot, Apuzzo wrote:

In May, The New York Times requested and received from the Pentagon its database of transfers since 2006. The data underpinned an article in June and helped inform coverage of the police response this month in Ferguson, Mo., after an officer shot Michael Brown, an unarmed teenager.

The Times is now posting the raw data to GitHub here. With this data, which is being posted as it was received, people can see what gear is being used in their communities. The equipment is as varied as guns, computers and socks.

We were intrigued by the publication of a dataset initially used back in June in reponse to more recent events, and spoke to Giratikanon, the team member who posted the dataset to GitHub, about why they posted the data:

As soon as we published the map on Friday, there was intense interest in the data: Many people emailed and tweeted to ask for it, and some specifically wanted the raw data to do their own analysis. I spoke with Matt Apuzzo, the reporter who requested and received the data in May, and he was gracious enough to let the raw data be published and to write a post about it for The Upshot.

Because the raw data was basically usable, and because I didn’t want to introduce errors, we published the Excel file exactly how we received it. It’s been great to see the stories that newsrooms have already written using it, and the attention that the release has received on Twitter and GitHub.

Giratikanon also noted the benefits—and risks—of opening the data to others for manipulation:

Something I didn’t expect: GitHub makes it easy for others to contribute to your data and code, which is powerful. A few people took the time to consolidate the spreadsheets and turn it into a CSV. One person began factchecking the data to see if the outliers made sense. But including their requests requires verification. For instance, one of the people who made a CSV version initially left off 20,000 rows by accident, which would have been easy to miss.

Yesterday, Buzzfeed’s Jeremy Singer-Vine published a short article on the extreme racial segregation in St. Louis County, where Ferguson is located. He also posted the data and code he used for his analysis, along with notes on the process, on GitHub. We caught up with Singer-Vine last night to ask about his experience with the data:

The data itself is pretty standard demographic data from the latest American Community Survey (the 2012 five-year estimates). I got it through the Census Bureau’s American Fact Finder. The tool can be a bit headache-inducing but contains an amazing wealth of data. The main part of the analysis uses a standard measure of segregation called the “index of dissimilarity.” No single measure of segregation is perfect, of course, but the index of dissimilarity is simultaneously widely accepted and reasonably easy for lay readers to understand.

An interesting bit from this Census report on segregation statistics: “Despite its imperfections, since Duncan and Duncan (1955) the index of dissimilarity has been and remains the most widely-used measure of the evenness dimension and no other index has achieved such widespread acceptance as a summary statistic of segregation.”

He also walked through his team’s rationale for publicly posting this and other datasets whenever possible:

Why post it on GitHub? First, a platitude that—while still a platitude—I endorse: As journalists marshall more data than ever, collect it from a wider range of sources, and analyze it in increasingly complex ways, it’s important (and interesting!) to be transparent about those processes. I think about it in three ways:

  1. Verifiability—A reader should be able to check our sources and code/math for any obvious mistakes.

  2. Reproducibility—Even better, we should make it possible for readers to conduct the same analysis and get the same results. This helps ensure that, on top of the analysis looking right, it also works right.

  3. Reusability—Even better-better if the data and code we publish is reusable. Reusability means different things for different projects, but could include: Being able to run the same analysis on updated / different data (e.g., for a different city or country); being able to tweak the parameters of a particular analysis; being able to run entirely new analyses on the data.

I also think that open-sourcing our work engenders good habits: Clean, readable code; sensical project structure; justifiable analyses; scriptable workflows; a general “separation of concerns.”

We’re trying to publish as much of our data and code as possible. To make finding these things a little bit easier, we created a new meta-repo today that lists all our open-sourced data, analysis, et cetera. (Note that this is specific to the newsroom data team; I can’t speak for the brilliant-but-separate site/CMS/other devs, who were publishing open-source projects long before I joined BuzzFeed.)

Of course, there are times when we won’t/shouldn’t publish the raw data. The most obvious scenario: When it includes sensitive personal information, such as medical records, social security numbers, et cetera.

Whether they’re published immediately as part of the articles they support, or posted later in response to public demand, these stores of raw data add a dimension to major news conversations that one-time analysis alone can’t offer. We’ll be keeping an eye out for more datasets that emerge into public view as the circumstances call for it—and for interesting uses and reuses of the data when they do go public.





Current page