To Scrape, Perchance to Tweet
How we made the Chicago Tribune’s Illinois Campaign Cash data scraper
Yea, these are the glories given unto us by the Illinois State Board of Elections, in its infinite wisdom and majesty. Not to publish an API, nor to frequently alter the DOM structure of the campaign contribution pages that contain the data we seek. For us, one path is left open, and we hesitate to skip down it but skip we must, for none have before us and none should have to after we are done with our work.
At the Chicago Tribune, we had a simple goal: to automatically tweet contributions to Illinois politicians of $1,000 or more, which campaigns are required to report within five business days. To see, in something approximating real time, which campaigns are bringing in the big bucks and who those big-buck-bearers are. The Illinois State Board of Elections (ISBE) has helpfully published exactly this data for years online, in a format that appears to have changed very little since at least the mid-2000s. There’s no API for this data, but the stability of the format is encouraging. A scraper is hardly an ideal tool for anything intended to last for a while and produce public-facing data, but if we can count on the format of the page not to change much over at least the next several months, it’s probably worth it.
Scraping Is Fragile
Of course, when dealing with any scraping-and-parsing system, nothing is forever. The format of the pages could change completely tomorrow, and we’ll have no choice but to adapt our code. That’s something we would ordinarily prefer to avoid, but from time to time—and this is one of those times—there’s no alternative way to get the data we need.
We want this project to be useful for awhile, and we hope that others are able to use our code to build interesting and useful projects on top of this important data. We’re committed to maintaining and expanding our code, since we’ll need to for our own benefit, at the very least, and we might as well share the products of that effort with the community. For the moment, though, it’s worth keeping in mind that, on top of the inherent instabilities in relying on a scraping system, the code is at a pretty early and unstable phase of life—so caveat forktor.
When we got started, the code was—in some distant future—intended to be open-sourced, but we started by just getting a proof of concept working, to show our editors and determine if the project even made sense to pursue. In the beginning, the parsing code was tightly coupled to a database and an auto-emailing rig, because it was initially quicker to throw everything together. One of the great benefits of open-sourcing code is that it can enforce good engineering habits, like modularization and separation of dependencies, and so our code now is cleaner in large part because we decided to put it out in the open.
There are a lot of web-parsing options out there, but for this, we took advantage of Python’s wonderful BeautifulSoup library. We’re not doing anything too complex here, and the ISBE contribution pages stick to pretty rigid table structures with descriptive class names, so we can make a lot of simplifying assumptions that help keep the code clean. One of the biggest problems encountered while parsing webpages is that the code is so reliant on magic numbers and hyper-specific patterns that it can be very difficult for someone else to read, understand and maintain. We wanted to avoid that here; you can take a look at the code and judge for yourself how well we did.
Once we had a parser basically working for the report type we cared about, all that remained was to hook it up to Twitter. The tweepy library makes this part pretty simple as well, and from a technical standpoint there’s not much to it. However, fitting everything we needed into 140 characters turned out to be delightfully challenging.
Getting to the Tweet
One of the great things about Twitter, like many linguistic forms with hard constraints (iambic pentameter, haiku, etc.), is the entertaining contortions involved in fitting chaotic and sprawling thoughts into a rigid structure. The challenge for an automated Twitter account is therefore computationally poetic. In our case, some committee or contributor names are much longer than others. There’s some variation too in dollar amounts and URL sizes. We also wanted to include a way to highlight specific, large individual donations, which exposed us to further variance in name lengths.
In the end, our solution was a bit blunt, but effective. We assigned up to 117 characters to the summary text (“$25,000 from 12 contribs to so-and-so”), truncating everything after 117. Since the last, and most variable-length, field was the name of the campaign committee receiving the money, odds are that if we truncate it a bit, it’ll still make sense; whereas truncating a dollar amount or a number of contributions would distort the meaning of the tweet, and possibly confuse users. We stop at 117 rather than 140 in order to allow room in the tweet for a URL to point to the report itself on ISBE’s website, so anyone who’s interested can see firsthand what the data says.
How Is It Working?
We’re very interested to see how well this works, and whether it’s useful or helpful for reporters, citizens and anyone else interested in campaign finance. Let us know what you think.
Data, journalism. Formerly of @KUOW, @chicagotribune, @googlenews, @LisaMadigan. Views expressed are my own. Definitely someone you could body slam. He/him