Learning:

Open Your Data

Waldo Jaquith on the whys and wherefores of making it open


In the days of print-only journalism, source documents were rarely published because of the expense of printing them. The New York Times could publish only excerpts of the Pentagon Papers, because it was some 7,000 pages in length; other publications merely wrote about the revelations contained within the Department of Defense study, rather than actually reproducing portions of it at all. Today, in the era of digital journalism, there is no longer any such thing as space limitations. The modern Times wrote extensively about the revelations provided by Wikileaks, but also provided a web-based archive of particularly interesting cables, in their entirety.

It has become standard to use DocumentCloud to publish such source materials. Most major media outlets have accounts, and when reporting relies on FOIA-ed documents, they upload those documents to DocumentCloud, and link to them from the story. This allows readers to verify the claims within the article, and to dive more deeply into those source materials than the story did. Perhaps more important, the mere presence of those links lends authority to the story, even if few readers are likely to actually read the linked-to source materials. To maintain credibility among readers, it’s important to move beyond “just trust us.”

It no less important to provide source materials for quantitative data—spreadsheets, databases, etc. Not only does this permit the same process of verification and exploration, but this allows more technically astute readers to duplicate the analysis performed by the reporter, and perhaps build on that analysis. However, providing those materials can be a lot messier than simply uploading files to DocumentCloud. At its worst, this can be an onerous process, creating a great deal of technical debt and requiring indefinite updates. But at its best, it can be almost as easy as sharing documents on DocumentCloud, amounting to a set-it-and-forget-it proposal.

Let’s look at three situations in which data is—or isn’t—shared.

The Saddest Spot in Manhattan: a cautionary tale

In September, the New York Times’ “City Room” blog reported that a sentiment analysis of tweets by the New England Complex Systems Institute had revealed that Hunter College High School is the saddest place in Manhattan. The story had previously been reported in other publications (Science magazine and New York Press), but the Times coverage did a lot to increase popular awareness of the study. The top-rated public school’s dubious distinction came as a surprise to some students, and many of whom just didn’t buy it. Commenters on the Times story raised questions about the data, such as how the researchers knew that the tweets were coming from the school, and not people at the nearby subway stop, people walking by on the street, or students at either of the three schools that are within a couple of blocks. But because the original data had not been provided, it was impossible for anybody to verify the claims made within the article.

Three weeks after the story was published, the Times retracted it. The author of the study, Professor Yaneer Bar-Yam, confessed that not only were the tweets not coming from Hunter College High School, but they appeared to just about all have come from a single person living in the area.

Mistakes happen. By not providing the original data (nor, apparently, requiring it from Prof. Bar-Yam), this mistake lingered for weeks before being corrected. Had a CSV file of the geotagged tweets and their calculated sentiments been included, one imagines that a reader (or Times journalist) comfortable working with data would have discovered the errors in analysis within hours, not weeks.

Stimulus Spending: sharing the wealth of data

During President Obama’s first term, $800 million in stimulus money was allocated by the federal government. That spending was accounted for on Recovery.gov, but the raw data provided by the federal government was divided across hundreds of files. Like many media outlets, ProPublica was keeping an eye on that spending, to inform readers where their tax money was going. ProPublica published a series of articles about recovery spending, “Eye on the Stimulus.”

In order to conduct the research necessary to write about recovery spending, ProPublica needed to combine those hundreds of files from Recovery.gov into something more manageable. They created a couple of Excel files for their own use and, realizing that others could use them, they simply posted those Excel files to their website. Readers could download those files, open them in Excel, and use that program’s popular point-and-click interface to conduct their own analysis.

This is a testament to the notion that raw data need not be fancy. While Excel files aren’t ideal (not everybody owns a copy of Excel), they’re a great deal better than providing no data at all.

Playgrounds for Everyone: getting data from everyone

In August, NPR’s All Things Considered aired an in-depth story about how playgrounds are changing to accommodate children with disabilities. To research the story, NPR staff need to find out how widespread that accessible playgrounds are, which was data that simply didn’t exist. After gathering data from parks & rec departments across the country, they made their collected data available for download as CSV and JSON, and then asked visitors to add any playgrounds that were missing.

The NPR team didn’t just share their data, but also allowed others to submit enhancements to it, and then built a website to let people browse that data. The resulting Playgrounds for Everyone website currently lists nearly 2,000 accessible playgrounds around the country. Over 500 of those were submitted by readers. (The impressive technological infrastructure behind Playgrounds for Everyone was described in detail by developer Jeremy Bowers in a September Source article, “Complex But Not Dynamic.“) This is a step beyond sharing source data, one that allowed NPR‘s data to become richer, and has established them as the authoritative source of data about accessible playgrounds. The radio story only aired once, but the underlying story continues to develop on their website.

How to Share Source Data

When evaluating the methods of sharing source data, there are two primary criteria to be balanced: usability by your audience and the potential difficulty of your organization having to provide that data indefinitely.

There are three primary ways to share your source data.

  • As bulk data files
  • By creating an application programming interface (API) for it
  • By outsourcing that data to a third-party service

Bulk Data

Bulk data” means, in most cases, simply sharing the same raw data that you used. That might be an Excel file, CSV, XML, GeoJSON, or any number of other formats. This is the easiest path, if not particular technically sophisticated. That lack of technical sophistication has an advantage, though, which is that far more people know what to do with a data file than know how to use an API. Providing files means that no future maintenance is required—there’s no need to ever think about them again. When it works to provide source data as a bulk download, this is generally the best approach.

Application Programming Interface

At the opposite end of the spectrum is an API. This means writing or installing software to interface between requests and a database, serving up just the data that’s requested. Where a bulk download is, well, bulky, an API is nimble. For complex types of data, such as those with rich relationships or multiple interrelated tables of data, only an API will do. An API is enormously useful to a tiny sliver of people, but even many programmers wouldn’t know what to do with one. The real catch is that APIs incur a lot of technical debt, in the form of maintenance for the indefinite future. That data is stored in a database—what if the database becomes corrupt? Are you going to continue to update the database software indefinitely, as security holes and bugs are identified? What about when an eventual new version of your API’s software (Ruby, Python, PHP, etc.) no longer supports that database—are you going to drag the data into a new database to let that one-time feature live on, or let it fall by the wayside as a victim of bitrot?

API/Bulk Download Hybrid

Parenthetically, there’s an approach that provides the best of both the bulk download word and the API world.

For many types of data, one can simulate an API via clever bulk downloads. A directory full of properly named JSON files function the same as an API, providing the ease of bulk data, but with the nimbleness of an API.

Imagine that you’re writing an article about restaurant health inspections, for which you have 1,000 inspection records, one per restaurant. You could create one file, index.json, that is a list of every restaurant, providing each one’s name, address, aggregate health score, health department ID (e.g., 00000001, 00000002, etc.), and API URL (e.g., http://www.example.com/specials/healthinspections/api/00000001.json, http://www.example.com/specials/health_inspections/api/00000002.json, etc.) Then you could create one file for each restaurant, 1,000 in all, containing the actual inspection data, named 00000001.json, 00000002.json, etc. For good measure, you could zip all of them up into a single download, health_inspections.zip.

This faux API provides the best of bulk downloads and the best of APIs. It incurs no technical debt, creates a very simple API that requires little technological sophistication, has a bulk download component for those who would prefer to work with the data locally, and provides the granularity of an API.

Hosted Services

There isn’t yet a DocumentCloud-for-data that’s tailored to media outlets (hint, hint), but there are a couple of hosted services that can be easier to get set up with and make it easier for end users to interact with your data.

There’s Socrata, a company that’s primarily in the business of hosting government data sets. They make it easy to upload data sets, they automatically convert them into a half-dozen popular formats and create an API for them, and they provide embed codes so that you can allow people to browse the data on your own website.

And there’s GitHub, which was created to host collaboratively developed software, but that also does a fine job of hosting data sets (ranging from CSV to geodata) in a fashion that makes them easy to browse and interact with.

Do It Now

Programmers aren’t wizards, and comprehension of their work isn’t beyond the reach of mere mortals. “Because the computer said so” is not reason enough to claim to have identified the saddest place in Manhattan. Sharing source data allows readers to see the facts for themselves. Although very few readers are likely to look at any provided data, the immediate availability of that data means that those who do study the data will serve as de facto peer reviewers, and some will even use that data to uncover new aspects of the story.

Soon enough there will be a story as big as the Pentagon Papers or Wikileaks, but the source material won’t be prose—it’ll be data, and the lede will be derived from analysis of those data. The publication that breaks that story will need to make that data available for review, if they’re to be taken seriously. The time to institute that practice isn’t when it happens, but now. In a small news organization, it’ll be your call. In a large news organization, you may have to fight the good fight to get there.

People

Organizations

Credits

  • Waldo Jaquith

    Working dad. Thought follower. Federal employee. Former political appointee at @WHOSTP44/@USGSA. He/him. Last name pronounced JAKE-with.

Recently

Current page