Invest in Trust and Make Projects Reproducible by Sharing Your Data Analysis
Here’s what we tried at the St. Louis Post-Dispatch to open up our process in the spirit of transparency.
Journalists ask for transparency from sources and public officials. The public has a right to know how their government works, who their elected officials are, and how tax money is spent.
But it’s also important for journalists to be transparent with their readers, and, whenever possible, give them information (metadata, if you will) about where our reporting came from.
We want them to know we are honest people who are trying to get the most current and accurate information we can at the time of reporting, and that we’re trying to give it to them as quickly as possible.
Sharing a data analysis is one tactic to open up your process and improve transparency. Showing your process to your audience, so long as you won’t burn any sources, is a great way to earn readers’ trust. It’s also a pathway to starting a conversation. Journalists are used to getting angry emails and tweets. A conversation with readers doesn’t always have to be hostile, or even a direct exchange on one of those platforms. Publishing your data analysis is a form of conversation without the comment-thread rabbit holes that lead to less understanding, rather than more.
For all these reasons, I published the data analysis for a St. Louis Post-Dispatch story earlier this year. It was about unsafe parks and playgrounds, and an understaffed parks department. Using Python and a tool called Jupyter Notebook, we showed where we got data and how we “sliced and diced” it.
How We Got Started
Long before we shared our first Jupyter notebook, our three-person interactive team at the Post-Dispatch had been talking about creating a reproducible data analysis and publishing it. A good data journalist always presents her analysis and conclusion to sources, and asks for confirmation of the results.
But sharing the behind-the-scenes with our readers was something new. We just needed the right story, to give it a try. Although publishing a reproducible data analysis is relatively user-friendly, deciding on an appropriate story turned out to be somewhat complex.
For example, I knew I wanted the first Notebook we published to be simple and straightforward, and I wanted it to use data I was already familiar with. At the same time, I also wanted it to be a bigger, longer story, preferably with a chance for impact. As one of three people on the newsroom interactive team, I was conscious that I couldn’t spend a week on publishing something overly simple such as code that only sorted, filtered and summed.. I’d have to justify it, not just to my bosses, but to myself.
Choosing the Right Story to Share
The story we eventually chose was by reporter Jesse Bogan, about a five-year-old girl who’d had her head cracked open by a 200-pound steel door at a St. Louis park. The door had come loose from its hinges, prompting broader questions about park safety and staffing. Bogan asked me if I could learn how many complaints the city had received, within the last year, about any of its 109 city parks.
A Jupyter notebook seemed like a great choice for several reasons.
If Bogan or any other reporters revisited the story in a year or even ten, we could re-run the data easily. Plus it featured a strong news peg and a well-stocked city government open data portal. I was also already familiar with the data, and I knew what the column headers meant and how the records were created.
I set up a Jupyter notebook, and got to work.
Here’s what we ultimately published.
Your Turn: A Few Criteria for Choosing Where to Start Publishing Your Data Analysis
If you’re thinking about sharing your data analysis and workflow using a Jupyter notebook, here are some ways to think about choosing the right story for your first effort. A story might be right if…
The data is public.
For the St. Louis parks story, anyone can access the raw database as I downloaded it, filter to the appropriate dates, and run it through the notebook if they want to double check the numbers in the story. I like to think the type of analysis and presentation we used as being the point of open data — it is open so we can all share it and learn from it. Sharing it feels okay from the business side, too. Because it’s public data, we didn’t spend a lot of staff time requesting the data or negotiating for it. The city of St. Louis posts the Citizens Service Bureau data online for free. It’s essentially customer service for residents of St. Louis, and includes records for each complaint called, tweeted or emailed in. They’re often referred to as 311 calls.
It’s not an unmanageable amount of data.
The data won’t take up considerable space on GitHub. It should be fairly easy to download and open on a personal computer, given that the person is familiar with .zip files and Microsoft Excel.
You’ve done an actual analysis.
This isn’t us cleaning public data and presenting it in a more digestible way, as we do with our salary and education guides. For those guides, there’s no analysis involved—not much going on “under the hood” that would interest anyone.
For the parks story, we needed to look at calls from concerned citizens about safety or other issues at parks. There were several codes for problems at parks in the 311 data, and we had information from interviews about what other codes to look for.
First, I focused on complaints about playground equipment specifically. I totaled all playground equipment complaints from 2009 to the beginning of May, which was the most current data available. I grouped the complaints by year to see if one year had a lot more complaints than the others; this could potentially indicate dirty data. It was one way of checking to make sure one incident or piece of playground equipment wasn’t generating a disproportionate number of calls. Though that is one caveat of this data, and one that we note in the story — the data only reflects what’s called in. When grouped by year, there were no extreme outliers, which made me feel comfortable proceeding in using the playground equipment complaint data.
Bogan’s story was, more widely, about a shortage in parks staff, and the impact of that on parks, so we included codes like high grass, problems with park restrooms, and trash in the park.
There’s a story.
A story in the data means: Maybe I spot a trend in the 311 data, one day when I’m hunting for story ideas. Maybe the fact that complaints about playground equipment remained steady for a few years, then ticked up a bit, inspires me to approach a reporter with the potential story, or to do a story myself. Even if we do see a trend, however, there may not be a story. We can see a certain trend is developing, but can we find the people to speak about what’s happening? This is my favorite challenge of the job.
Other times, a reporter is pursuing a reporting related to the data. That’s the way the parks story happened. Bogan had a story already, and he asked me if there was any data out there regarding complaints at parks. Finally, some nerdy preparation paid off, and I knew exactly where the data was. I’d worked with the data before when we’d examined illegal dumping and other trash issues in St. Louis.
The technical side isn’t too technical.
This isn’t to say we won’t publish more complex analysis or larger datasets. But at least for the first Jupyter Notebook we published, I was really hoping for a notebook that didn’t contain hundreds of lines of code. I wanted to publish the code in a way that a person who isn’t familiar with Python or any programming whatsoever may at least be intrigued — and not overwhelmed — by the analysis.
Here’s the whole reason that I love data journalism as much as I do: We have the ability to shed light on issues that would otherwise remain hidden, whether that’s because powerful people either didn’t look for them, didn’t have the technological know-how to look for them, or just plain didn’t want to look for them.
Even though I thought few people would want to read or look at computer code, our willingness to put code out there for anyone to see is an indicator that we as journalists have nothing to hide.
Reporting the news is a time-intensive job, but finding time to create more transparency is an investment in reader trust. I can’t think of anything more important.