Learning:

Freeing the Plum Book

Derek Willis mines government mobile apps to liberate data and issues a call to arms for collaboration

Posted on: April 11, 2013

Every four years, a congressional committee publishes a snapshot listing of nearly 4,000 presidential appointees throughout the federal government. Journalists, lobbyists and students of government know this as The Plum Book—due to the color of its cover—and they use it to help identify who works where, and roughly how much an appointee earns. The appeal of this publication to journalists is obvious: it helps them to make the sprawling federal bureaucracy a bit easier to navigate in terms of knowing the officials that lead an agency, or whom to contact for stories. It’s also a way to see which campaign donors and other presidential supporters have been placed in key administrative posts.

The Plum Book is a key window into the workings of the federal government, yet it is published only sporadically and in difficult-to-parse formats. The Government Printing Office, which publishes the Plum Book and scores of other federal references, has been slowly making its products more digital-friendly, but that process has been a long time coming. But if we had access to the Plum Book listings as data, there are a number of stories that could be possible to accomplish without investing large amounts of time.

Annie Lowrey, a reporter in The Times’ Washington bureau, had told me about one such story. She was interested in determining the gender breakdown among President Obama’s appointees, and was hoping that there was a way to use the Plum Book to help with that process. As she started reporting on the story, I went to work on the data. I was prepared for a long slog into PDF parsing and regular expressions. What I found instead was that, in this case at least, innovation by the federal government made my job easier, and made it possible for The Times to publish a story that we hadn’t done before.

The Problem: It’s a Book

The initial problem with The Plum Book is just that: it’s a book. For all practical purposes, even though it contains page after page of lists of names, positions and salary information that anyone who has used a spreadsheet would recognize as data, searching the contents—for most users—means a lot of Ctrl-F and frustration.

A number of journalists have gone through the process of converting the existing (and previous) Plum Book PDFs into parseable text, in various ways—by using programs that can convert PDF content into fixed-width text or by copying and pasting or scraping the HTML version.

None of these methods are clean and easy, for many reasons. First, the PDF content wraps over lines, requiring lots of data cleanup. Next, the “columns” are separated by variable number of periods. Finally, the data are divided into sections that represent offices and agencies, which then must be incorporated back into the clean rows for any analysis. Just typing the whole thing in starts to look like a decent approach after awhile.

In late 2012, however, the GPO did something very different. For the first time, it published a mobile version of the Plum Book, which allowed users to browse presidential appointees by branch, agency, position and other criteria. Mobile apps from the government are still something of a novelty, but this one caused me to really stop and think. Maybe this was a route to the data.

JSON Everywhere

It was. The Plum Book mobile site is built with Backbone.js. A little exploration (with help from my colleague Jeremy Ashkenas, Backbone’s creator) revealed JSON data on the backend for each position. Here’s an example:

{ "location":"Washington, DC", "id":1, "title":"Secretary", "expires":"", "branch":"Executive", "tenure":"", "agcy_name":"Department of Agriculture", "org_name":"Office of The Secretary", "pborg_seq":"6752", "pborg_managed_by":"6751", "org_order":"10", "name_of_incumbent":"Thomas James Vilsack", "type_of_appt":"Presidential Appointment with Senate Confirmation(PAS)", "pay_plan":"Executive Schedule(EX)", "pay":"I", "pb_order":"5" }

So instead of having to parse semi-structured text, all I needed to do was to retrieve the JSON representation of each position and store the attributes. Using Ruby, I wrote a script that cycled through the positions and saved the JSON locally. This proved fairly simple, but when I ran it I noticed that after a decent number of positions the GPO site would stop serving data. I believe this was an automatic trigger, since I was able to switch to a different IP address and continue downloading the JSON.

Missing Ingredients

Turning the JSON data into a CSV file that Annie would be able to use was the next step and the second problem. The methods to accomplish this are pretty standard to any language and not worthy of discussion. But that CSV file still was missing one piece: the gender of the person listed, and we needed that in order to answer the basic question that Annie was trying to answer: how many Obama appointees were women?

Time and again, people who work with data encounter similar problems. There is no reason to invent new solutions when existing ones can be used or improved upon. Journalism benefits when open-source developers share their code, which is why it’s important that we share our work, too. In determining the gender, we relied on shared code, in the form of an unfortunately-named Ruby library called Sex Machine that returns a gender for each first name it is given (there are similar libraries in most other programming languages). This process worked for the vast majority of the names in the Plum Book, and Lowrey and Times researcher Kitty Bennett added the gender to the several hundred records for which the library did not provide one. It’s better not to trust the judgment of an automated process, though, so we set the threshold for accepting the library’s decision fairly high and left the remainder to be checked by hand. In this case, that meant reporting out the gender of most people named “Pat”).

This combination of JSON and Ruby delivered clean, consistent data in a few hours where previously a messier process had taken much longer and yielded less certain results. The gender identification process played a significant role in Annie’s story and a graphic by Alicia Parlapiano that illustrates the gender ratio in 15 cabinet departments.

Gender ratio graph (Alicia Parlapiano/New York Times)

We also released the data, both as an Excel file and on GitHub. This could make it easier to, for example, connect the people listed in the Plum Book with congressional, lobbying and campaign finance data, potentially yielding further stories.

As helpful as the now-accessible Plum Book data was to our story, our work also allowed us to publish a data set that’s now easier for others to use. There are many, many other government data sets waiting to be made more accessible. The good news is that the GPO has a few more mobile apps, including one for Presidential Documents. Governments at all levels are moving towards mobile apps that rely on data across the wire— and that represents opportunities for developers and journalists to get more information faster. We need people to identify and retrieve these data, and to highlight their utility for journalism.

Sometimes reporters know what data is available and what questions they’d like to ask of it, as in the case of Annie’s story. But too many times I’ve seen stories hit the web site or the paper and thought, “I wonder if that reporter knew about [insert data set here].” As much as I’d like to think that reporters will find the right data for stories, the truth is that too much knowledge about government data remains siloed in newsrooms, kept by individual reporters and departments, and among other users.

Worse, there is little consensus among advocates of greater government data transparency, journalists and academics on what data that currently is difficult or impossible to get right now would be most valuable to make accessible, and even less coordination on documenting the chosen data. Data catalogs, such as Data.gov, have their place, but there needs to be an editorial layer build on such efforts. That’s not the government’s job.

The Times released the Plum Book data and wrote about the experience as a step towards providing that kind of documentation, and if that can spur others to do the same, great. But the greater need, I think, is for intelligent collaboration on identifying, gathering and documenting government data. As an example, I’d point to an effort started by Eric Mill of the Sunlight Foundation, Joshua Tauberer of GovTrack and me that lives on GitHub (although The Times supports and encourages open-source work, my contributions to this particular project come on my own time). It began with our joint work on scraping congressional data but has since expanded into other federal government areas.

Because all three of us collect and work with congressional data for our organizations, we found ourselves repeating the same tasks, especially when it came to scraping the Library of Congress’ Thomas legislative site. Instead of reinventing the wheel, we chose to work together on our scrapers and took it a further step by releasing the data. It wasn’t difficult for me to pitch the idea to my editors since it was a way to benefit from the fruits of our combined effort: more eyeballs on the code and data means that we’re better able to check our work.

There are many other areas of government information to work on. We’re just getting started. Here’s a small list of some of the legislative datasets that could be compiled in more useful formats than just plain text. In particular, congressional committee information is scattered in different places and inconsistent formats. What we don’t know about committee actions—including membership over time, votes and other official actions—dwarfs what we do know. Working with this or other government data requires a willingness to dive in and the ability to ask questions, particularly of people who might use this data. Nearly every government dataset comes from a community; find someone who has worked with a dataset and you’re likely to find not only a good guide to it but also a series of ideas for improvements and what’s important.

There’s no reason why such foundational government data—data that helps news organizations create new stories and user experiences—should be maintained separately by multiple news outlets. Plenty of us work with congressional data, or campaign finance data, or environmental data. Why should each of us have to repeat the same steps over and over? A better way is to release public data that we do have, or work together to develop a common set of documentation and data for information that many of us rely on for reporting.

Welcome to the Club

The advantages of collaborating on common government data are many, but the most obvious is that the more people who work with a dataset, the more we’ll all know about that information if they choose to share their experiences. That’s why we also need better shared documentation of common data. Often, government data—whether it comes from Congress or an executive branch agency—is treated like some sort of ancient religion or fraternity, where outsiders don’t really understand what’s going on until they’ve spent hours practicing and asking questions. This is both terribly inefficient in an era of shrinking resources and increases the chance that, working in isolation, one of us will get something wrong.

The next time (which may even be the first time) that you work with a piece of government data, ask yourself some of these questions:

Who else has worked on this data? Have they offered any guidance? One place to start is to look at Investigative Reporters and Editors’ collections of stories and tipsheets.
Does it make sense to release some or all of this data to the public? A rule of thumb here is that if you could ever see yourself using this data more than once, it’s a good candidate for release.
Are the quirks of this data publicly documented? Some government agencies do this already, but if you had trouble importing the data or found it to be inconsistent, then you should make other users aware.
Could this data be connected to other data, increasing its value? If the data has a geographic element, can you improve it by adding, say, GNIS codes? If it deals with members of Congress, can you add unique identifiers from the Biographical Directory of the United States Congress?

We’ve got work to do. Roll up your sleeves.

About the Author

Derek Willis is an interactive news developer at The New York Times, where he works on political and legislative APIs and apps. He likes cricket, congressional procedure and making lists. He was a winner of Slate Magazine’s final “Six Degrees of Francis Bacon” contest.

People

Derek Willis

Organizations

Credits

Derek Willis

Derek Willis is a news applications developer at ProPublica, focusing on politics and elections. He previously worked as a developer and reporter at the New York Times, a database editor at The Washington Post, and at the Center for Public Integrity and Congressional Quarterly.
- The New York Times
- @derekwillis

Freeing the Plum Book

Derek Willis mines government mobile apps to liberate data and issues a call to arms for collaboration

The Problem: It’s a Book

JSON Everywhere

Missing Ingredients

Welcome to the Club

About the Author

People

Organizations

Credits

Derek Willis

From our Archives:

By what metric?

Freeing the Plum Book

Derek Willis mines government mobile apps to liberate data and issues a call to arms for collaboration

The Problem: It’s a Book

JSON Everywhere

Missing Ingredients

Share the Wealth

Welcome to the Club

About the Author

People

Organizations

Credits

Derek Willis

Recently

How to tell good LGBTQ+ stories with bad data

7 tips for data-driven journalism about LGBTQ+ communities

Fact-checking in 2024? Five tools to help with research and promotion

Search this site

From our Archives:

By what metric?