Competition Be Damned

How reporters at the Washington Post, New York Times, ProPublica, and more self-organized to free trapped FEC data

Last Wednesday, the Trump Inaugural Committee’s FEC filing appeared on the FEC site in its image-based glory. ProPublica’s Derek Willis noted its arrival on Twitter.

Later that day, the Washington Post’s Steven Rich made the first commit to a public dataset on GitHub extracted from the filing. Since then, several more news nerds, including Willis, have jumped in to contribute to and correct the data. Huffington Post White House reporter Christina Wilkie used the resulting dataset as the jumping-off point for a crowdsourced investigation of donors that became the basis for her own and other outlets’ investigations.

We spoke with four of the project’s contributors—Rich, Willis, the New York Times’ Rachel Shorey, and the Connecticut Mirror’s Andrew Tran (soon to join Rich at the Post), about how the data got freed.

Source: How did this collaboration happen? I assume this filing was what you started with—who decided to turn it into useable data, and how did other people help out?

Steven Rich: So on Wednesday morning, I got an email with that link to the FEC website from a reporter on the political team here, Roz Helderman, asking if I could scrape the data from the PDF into a spreadsheet so we could analyze it for a story. I said sure and created a dataset originally just consisting of transaction ID, name, and amount donated (or refunded). That was great for our purposes, and so I decided to put it on GitHub and continue to add fields myself and seek help from those who wanted the data as well.

Derek Willis: I saw the filing when it came out and had that usual familiar pang of annoyance at an image PDF, but it wasn’t something ProPublica needed to jump on right away, so I didn’t give it much thought other than looking though it a bit.

Then I saw Steven’s tweet about it and thought yeah, that’s the right thing to do. One thing about working on OpenElections is that I’ve become much better at data entry—not that there’s a scientific technique, but just from doing it. So I thought…I can contribute to this effort.

Andrew Tran: I saw a tweet about the FEC document from Derek Willis and thought it’d be fun to try to scrape the data from the PDFs into a spreadsheet. It took a while to script and generate and by the time it was done, I noticed some columns had a higher success rate and others didn’t—mostly in the donation amounts.

But then I saw my future co-worker Steven tweet out a repo of his clean data and saw that he was just missing the location information like address, state, and zip from the donors in the PDF, which my script had done a much better job of scraping.

It made sense to then focus on the data that he didn’t have yet. I let him know on Twitter that I was very close to putting together the last part of the data on Twitter in case anyone else was about to go down the same rabbit hole I was in.

Source: How did you do the actual extraction work? Did you hit any weird issues along the way?

Steven Rich: The extraction was a bit of a few things for me. It involved some Tabula to extract things like names and states, but some of the fields had to be manually extracted, like amounts and dates. The data was OCRed, but it wasn’t great for a lot of fields. Much of the data needed to be cleaned, but that’s probably the extent of weird issues.

Andrew Tran: I used the tabulizer package in R to go through the sheets and scrape the data. It was as if the whole PDF was optimized to be as difficult as possible for journalists to parse. Things that would result in the data not getting scraped: Sometimes pages of the PDF would be slightly shifted so if my scraper was looking for data in a specific area of the page, nothing would get picked up. And sometimes what was picked up just couldn’t be interpreted correctly. Numbers would have letters mixed in. Spaces between words disappeared. I’m surprised there wasn’t spilled coffee stains on any of the pages. What wasn’t picked up by my script, I filled in manually.

Derek Willis: I added a couple of things. One was the page number of the PDF each record appeared on, because that can be useful, and that’s a simple thing to add. Each form is just three transactions per page, so you just count. One thing is when the FEC does get around to typing things is they use an entity type—whether a donor is an individual or entity or whatever. Most are individuals so I went in and marked all individual and then marked the non-individuals as organizations. I also wanted to identify the members of congress. They were likely buying tickets for their constituents and other folks and lastly, I did the day.

Steven or someone had already done the month and the year [for each filing], and unfortunately some things you could get by OCR but not the days. So I blew through those—I did about a third of them that night and then the next morning on my commute on the Metro, I pounded out the rest.

Rachel Shorey: At the New York Times, I’m responsible for extracting and processing campaign finance data for regular deadlines. We’ve got a lot of tools internally to deal with the campaign filings, but the inaugural committees only file once every four years, so we hadn’t done a lot of prep work for those, and we knew it’d be custom. We expected a relatively easy-to-parse CSV file based on the past two inaugurals. But when the file came as a scanned PDF, I decided that the fastest approach was to read all 510 pages and manually enter the biggest donors into a spreadsheet so we could get some quick aggregates and get reporters’ eyes on it.

When Steven started putting the donor list on GitHub, I realized it’d be easy to correct a few OCR errors I had noticed while hand entering the data internally (one of which was on a million dollar donor’s name) so I submitted a quick pull request, which I continued doing as I noticed other issues as part of our reporting. The only “technical” part I contributed is writing a script to fix zip code formatting, which kept getting messed up when people would edit the data in Excel. We used the sheet as a backup check on some of the numbers I had hand entered. For example, I searched Steven’s doc for “$1,000,000” to make sure I hadn’t missed any million-dollar donors in my own count.

Source: When and how did you release the extracted data?

Steven Rich: I released the data on GitHub as soon as I had three fields completed and (relatively) clean. From there, I and others continued to add missing fields and clean. Once it was up, I added things like month, year, and notes. Rachel Shorey of the New York Times helped clean and standardize data. Derek Willis of ProPublica helped to add days, entity types and page references. Andrew Tran of the Connecticut Mirror (and soon to be my colleague at the Post) finished by adding address information. Chris Zubak-Skees of the Center for Public Integrity and Justin Reese, formerly of DocumentCloud, have since come in and helped to clean the data.

Source: I know Christina Wilkie is crowdsourcing research on the individual donors—do you know of other reporters or projects using it as well?

Steven Rich: I have seen some other folks visualize the data but I think what Christina is doing is by far the biggest thing being done with this data. I helped her get the ball rolling with this and am excited to see how well it’s done. It was a great idea on her part.

Andrew Tran: I pulled out the information on donors who said they were from Connecticut and sent it to the Mirror’s political reporter to dig into. I’m not sure if other local organizations are doing something similar but I hope they are. That was the whole point.

Derek Willis: My sister sent me an article from the Willamette Week which mentioned that the owner of the—it was originally a piece on the Intercept, and the Willamette Week expanded on it—a local hotel owner in Portland was tied to four donations of $250,000 through a network of LLCs. And this person had repudiated the message around the [Trump] campaign, but then they gave a million dollars to the inaugural.

The one that really caught my eye was the Katherine Johnson one. Had I been doing data entry of the whole thing, of just the names, I wouldn’t have given it a second thought—if I’d done the name and the address, maybe it would have come up, but probably not, it’s the next person to look at it who might be like “wow!” It underscores the value of having multiple eyeballs on all of these things. I mean, all of us looked at that record, and it didn’t strike a chord with anybody—until it did.





Current page