Databae, Better Bots, and the Automation We Need Right Now
Be smarter about software and share your imperfect code
In journalism, we report and we write and then rinse and repeat. Day after day we run through the same motions the way that we have. Our competitors do the same.
It can be a bit mind-numbing—and it wastes time that could be better spent talking to people, crunching data, and perfecting our writing. There are certain aspects of our jobs that we do daily and many that we do many times a day. So we say, “why not automate?”
We joke from time to time that we’re attempting to automate ourselves out of our jobs, and that certainly is partially true. But there are certain aspects of our jobs that we could never automate. We can’t create a bot that embeds itself in the lives of others like our colleague Eli Saslow and write beautiful opuses on the human experience But automation can make our lives easier, and it can make the lives of the reporters we work with easier.
The first thing we think about when we think about automating are things like the Los Angeles Times QuakeBot (which has been written about at length and which we think is awesome). But most automation in news, we believe, needs to happen behind the scenes. It’s not the kind of automation that writes our stories, it’s the kind of automation that alerts us to stories. It’s the kind of automation that compiles data at regular intervals. And it’s the kind of automation that simplifies some writing.
We’re at an interesting moment in American history. We don’t need to tell you that. It can be quite overwhelming if you let it consume you. It’s certainly come close to consuming us from time to time. But we’ve realized that one key to sanity lies in building the tools you need to allow you—and the reporters you work with—to stop worrying about the small stuff.
We also realized that we were not the only ones dealing with this issue, so when the opportunity came to pitch a session on this for SRCCON, we did not hesitate. Our session, entitled “Practical Approaches for Creating Software to Cover Democracy,” was designed so we could get a bunch of news nerds in a room and discuss the software we’ve built in the Trump era, but also to brainstorm what more we could be doing.
A Tale of Two Stories
There are two approaches to using data and code to tell the stories:
Writing software to alert you to stories that don’t exist (yet), or;
Creating a dataset by hand and then using software to analyze and present a story.
In July, during the White House’s “Made in America” week, we wanted to know how many foreign workers the Trump Organization had hired for Trump’s Mar-a-Lago Beach Club. We began scraping H1-B Visas ahead of time in case a story like this came about. And it did. The Trump Organization applied to hire 70 foreign workers in the midst of a presidential campaign devoted to hiring and creating in the United States. Scraping the data ahead of time allowed us to move and publish quickly.
While writing software to scrape and alert the newsroom of possible story ideas is fantastic, it can only go so far. It’s often hard to collect data ahead of a story you’ve started to piece together, but even when you know what data you’ll need, sometimes it’s just easier to collect the data using humans and a spreadsheet.
We did just that when we wanted to track Trump’s claims during his first 100 days. We could have written sophisticated software to track his speeches and use something like natural language processing to analyze the text, it felt better (and easier) to have our Fact Checker team keep track of statements in a Google spreadsheet and then analyze from there.
Writing software that alerts us to possible stories mixed with human-curated datasets allows us to write and report stories we’d otherwise miss. This nexus of data reporting has aided us in all facets of our reporting work.
The truth is, while President Trump dominates the news these days, these tools are just as useful if not more, for state, county, and municipal governments everywhere. Local officials may not have the vast business ties the president has, but they arguably directly affect more lives in the communities that they serve than their federal counterparts.
Don’t Repeat Ourselves
Another truth to come out of our session was that a lot of us are literally writing the same software to do the same stuff, and it’s wasting a lot of time that could otherwise be spent reporting or building more unique tools.
We are not the only ones to have built an alerts system for when Trump Organization businesses file for foreign worker visas. We are not the only ones attempting to scrape the personal financial disclosures of members of the Trump administration (which come in PDF format but are at least machine-readable). And we are not the only ones attempting to map out the president’s business connections for a better understanding of how he operates.
Great initiatives like OpenElections and the California Civic Data Coalition, bring together reporters across different newsrooms to improve access to specific kinds of data. But, often the software we write doesn’t merit as large of an undertaking. Often we write so-called “snowflake” scripts—bespoke, niche tools that get us to our deadlines. But, we think there’s another way.
The news business is a competitive one, to be sure. We at the Washington Post sure as hell want to land a story before the folks over at the New York Times. (Seriously, it’s a great feeling. I’m positive they feel the same way.) But the truth is that we’re all building the same house and it makes no sense for us all to build our own master bedroom. That’s a lot of master bedrooms when you really only need one.
It’s imperative that we work to shed the stupidity that is this competitive repetition. Why should we all be writing the same damn code to do the same damn thing? We shouldn’t. Which is why as part of this session we created a Github organization and a repository so that we can open source more of the code we’re writing that we think will be useful.
Announcing Databae: An Open-Source Initiative to Share One-Off Newsroom Code
Much of the code that we commit to the repo, at least initially, will be related to the federal government and President Trump. The two of us have been doing a lot of work on the subject this year and so that’s what our code is. But we’ll also add some local and regional software, like a scraper to pull cases from DC’s new and improved Superior Court site.
In that spirit, we created an organization called Databae. (Steven already owned the domain name and this seemed like a fitting application.) The goal is to collect and publish single-serving scripts that others might be able to benefit from. For example, a script that will help you scrape the D.C. court system and another that’ll let you query the FEC API by contributor name and by year. Chances are if you’re reading this that you’ve written similar code. And that’s great. But we think by sharing these tools, we can create an ecosystem where we can spend less time on StackOverflow and old Gists and more time reporting stories.
Here’s where you come in. Our DC court case scraper is great for us, but a repository of every state’s court systems could be useful for a lot of folks. This is why we’re encouraging you to help us build one of the biggest repositories of useful software for democracy.
Now you may think your code is ugly and sharing it would reveal you as a fraud. We know you feel that way because we have felt that way a lot. But the truth is that if your code works, it’s good code. No one is here to judge. We will love you if you commit and we can help you if you need it.
At conferences like SRCCON, I’ve heard more and more about collaborative journalism. While we may not always collaborate à la Electionland or the Panama Papers, there’s no reason we can’t scratch each others’ backs by open-sourcing our code and allowing us to get back to the stuff we have the most fun doing.
Final Thoughts on Creating Software to Cover Democracy
Working as a data journalist means juggling writing stories and writing software. And even the most seasoned of us still have to sometimes re-remember how we wrote a script (documentation is key!) or more often why. Thus, we came up with seven guiding principles that we think makes it easier to manage these tasks:
Cron is your friend: Cron tabs help you script code on a server at regular intervals. This is much easier than setting an alarm on your phone to remind you to run a script from your work laptop at some awful hour.
Ugly code is good code if it does what you intend it to do: While it’s smart to write code that can go into production, often the scripts we write only need to do one thing (and not always well!) Unless you code has some sort of public facing portion, focus on getting the results you want so that you can spend more time reporting.
Don’t be afraid to think outside the box on data sources: We often gather data from APIs, open records requests, and databases, but data exists beyond that. For example, statements made on television by local politicians aren’t often collected into a single dataset. Collecting this data for the first time could lead to interesting stories and future reporting targets.
You don’t always need code: Picking up a phone may save you an entire day or two of work.
Use your newsroom: Even if you’re a lonely coder in a newsroom, chances are some reporter has already tried to look at the data you’re looking at or at the very least has heard about it.
The data may not not be in one place: Much data that is collected across the country isn’t aggregated at the federal level. Pulling and combining data reported to state agencies can set your work apart.
If possible, open source: We’re all in this together, so if you solved a hard problem, or gathered data that seems relevant beyond your immediate newsroom, why not share it with the world? You can always publish your code or data at https://github.com/databae-org/vulcan.
The news won’t stop and neither will the deadlines. It’s key to have tools and systems in place that allow us to do our best work without thinking about the specifics. Document your work and share it with your community. The more we share our work and data, the more we can focus on telling the best stories about our communities. And hopefully we’ll make democracy a little better, too.
Steven Rich is the database editor for investigations at the Washington Post. He’s also a board member of Investigative Reporters and Editors. He’s a (fairly recent) grad of Virginia Tech and Mizzou.
Aaron Williams is a data journalist, analyst, and visualization expert tackling inequity in data and design at scale. He’s currently a senior visualization engineer in Netflix’s Data Science and Engineering group and previously spent a decade as a data and graphics reporter—most recently at the Washington Post.