Chase Davis on fec-standardizer
Machine learning + campaign finance standardization
Last week, Chase Davis launched a public experiment called fec-standardizer—a machine-learning project designed to find out whether the manual process of campaign donor standardization and de-duplication might be automated. Davis has published his code and documentation on GitHub, including detailed walk-throughs of every part of his process to date. On Friday evening, he spoke with us about the project’s origins and future, and the challenge of scaling human intuition.
The Origin Story
Q. What led you to take on this project, and to document it so thoroughly?
I’ve covered local, state and federal politics, and the problem of donor de-duplication has always frustrated me. Looking at campaign finance data from the donor level rather than the contribution level is just so much more useful—plus, because it’s so damn hard to standardize the data, it offers a lot of new ground to explore. There are a ton of applications for donor-level analysis we haven’t even thought of yet.
The specific motivation came in part from talking to folks like Derek Willis, Chris Groskopf and Matt Cutts at NewsFoo last month, where we all generally agreed this was a problem that needed to be solved at scale. I’ve been playing around with machine learning for a few years, and I’ve been lucky enough to meet a bunch of smart data scientists out in San Francisco, so I’m always looking for excuses to play with those skills in the real world. It also helps that I left my day job a few weeks ago, so I’ve got some extra time on my hands.
As for the documentation, it started as notes to myself so I wouldn’t forget what the hell I was doing. But I’m also a big believer that data science tools and techniques will be huge in advancing data journalism, so I wanted to provide a resource for anyone who was interested.
What It Does
Q. If fec-standardizer can replace humans in the work of donor standardization, what will that mean, in a practical sense? What are we gaining?
One of the really interesting things to me about this project is that it’s basically modeling the intelligence of human beings in order to do its job. Whatever secret sauce CRP uses to de-duplicate their donors—human, machine or otherwise—is essentially what’s being learned by these algorithms. It’s piggybacking on their years of experience, which I think is awesome. Algorithms bring human intuition to scale. We should be looking for more opportunities to do that in journalism.
That said, people still do a better job than machines at making even the simplest decisions. The idea of bringing artificial intelligence into the mix is that it can take care of the easy stuff and leave the harder stuff for humans with actual brains. That’s part of why I wanted to use a classifier that makes probabilistic judgments. When it’s confident, it’s usually right. But when it’s not, you want people to step in and make the call. In that way, these kinds of tasks are interface problems as much as data science problems. The trick is knowing where to draw the line to maximize the strengths on each side.
Where It’s Going
Q. Are you looking for help with anything in particular as you head into optimizing and generalizing the project?
At this point, I’m mostly interested in applications. What kind of cool stuff would people like to do with donor data that we haven’t seen before? I mentioned in the project writeup that we hosted a data mining contest last year with IRE and Kaggle, which is a data mining challenge company based in San Francisco. The idea was to see what professional data scientists would do with a dataset that journalists look at every day. We got some really cool responses back, which you can read more about here. We’re also hosting the winner and a few other entrants on a panel at the NICAR conference in Louisville next month.
Shameless plug aside, the point is I sometimes think coming up with truly useful applications for some of these whiz-bang technologies is harder than actually building the technologies themselves. So if this tool opens any doors for folks to try out new and interesting things, give a holler.
Q. What kinds of prerequisites would be helpful to people who want to use fec-standardizer?
At this point, the project isn’t really designed for public consumption, so to speak. It’s set up more like an experiment than a tool. My goal was to see whether an automated approach was even feasible for a task like this. Preliminary results suggest it probably is. The next step is to turn it into a tool that generalizes across any campaign finance dataset out there—local, state or federal—and offers some more options for customization.
But that shouldn’t stop people from playing around! I tried to keep the documentation as plain-English as possible, but a vague understanding of things like machine learning and graphs wouldn’t hurt.
Q. How about tech/environment setup? What would we need to dig in?
Not much. The app is built in Python and Django. It runs by default off off a SQLite database to make the setup easy, although you’ll want to upgrade to Postgres or MySQL with larger datasets. All the required packages are in the repository’s requirements.txt file.
Be warned that some of the requirements are a pain to get running on OSX. I’m thinking of Matplotlib and scipy in particular. Here are a couple resources I found useful in dealing with those problems:
Challenges & Rabbit Holes
Q. Did you encounter any unexpected challenges during the development process?
I didn’t really come in with much of a plan or a vision for what the final product would look like, so it would almost be wrong to call anything unexpected. But I was definitely surprised at where some of the work took me.
For example, I didn’t use it much in the final product, but I spent a good amount of time early on looking into an algorithm called locality-sensitive hashing, which can be used to cluster things like text in linear time. Turns out it didn’t add a lot to this project, but I can think of a bunch of other things it might be useful for: finding plagiarism or duplicate text at scale, for example.
I probably spent a few days running down that rabbit hole, learning about the details and trying to implement it with mixed results. It didn’t pay off at the time, but now I’m working on another project that features it front and center. So the lesson is, run down rabbit holes.
Q. Any advice to developers or journalists who want to take on similar projects?
Learn math, specifically linear algebra and statistics. My math-fu is weak at best, but you can do a lot if you know enough to be dangerous.