How coding can change the very journalism we do
From faster, replicable work to multi-story databases, here’s what I’m most excited about after my fellowship with a data team
This year, I spent ten months as part of the data team at The Marshall Project, a news organization covering the U.S. criminal justice system. Coming from an environment where we mainly relied on spreadsheets and off-the-shelf tools for visualization, I thought doing data journalism with Python full-time would be like using spreadsheets on steroids. I was wrong.
A journalist who codes can:
- Find not just one, but multiple stories, by setting up a database;
- Ensure that their process is easier to replicate and faster than manual work;
- Know which projects need code, a data pipeline, or even a team—and which ones will work just fine in a good old spreadsheet.
I learned that coding can change the very journalism we do, and here is how:
Two brains are better than one
Journalists don’t often share unfinished drafts when they’re writing, let alone watch each other write them in real time. With coding, this somehow turned out to be the most liberating practice. I would watch my colleagues search for solutions, adapt a copy-pasted Stack Overflow answer. Often they would fail, and have to try again. It was like peeking behind the scenes of the magic show—which suddenly looks much less like magic and much more like a craft.
Googling for just the right code snippet, discussing why things don’t work as expected, or even just having another pair of eyes on the code turns out to be helpful. Writing code with others watching you might make you feel clumsy at first, but it gets more empowering as you go. For a data journalist who was just switching to code, this was an invaluable exercise.
For journalists who work as a single-person data desk, struggling with code on their own, finding that extra pair of eyes might involve reaching out to a friend or peer somewhere else. (Editor’s note: If you work solo, you can reach out to programs like Peer Data Review or find lots of people to connect with in journalism Slack communities like Lonely Coders’ Club, the Journalists of Color Slack, or News Nerdery.)
However it comes together, having a team—and even better, a data editor—can take your performance to another level. It lets you brainstorm methodologies and discuss alternative technologies; screen share when you are stuck; let people specialize in different directions, be it databases or design; refactor code so that it’s clean and smart; have somebody actually review your code; and let others pick up your code or re-use it in new projects.
This actually happened to me because my fellowship ended before the last story I was working on was published. Because there was code and a readme file, my colleagues could continue right from where I stopped (and I hope they did not have too much of a headache!).
Data projects can generate many stories at a time
Coding facilitates “generative journalism,” a concept pioneered at places like ProPublica, L.A. Times, and NPR.
Imagine a big, multi-faceted database that encompasses a system as a whole and allows one to run a practically infinite number of queries that test numerous hypotheses. This requires more effort at the beginning stage, scraping data from sometimes hard-to-scrape websites, elaborating a way to clean and join it together, documenting the process. But once this is done, the database turns into a trove of stories.
One example of such a trove would be Testify, a project spearheaded by my colleague Ilica Mahajan, where tens of thousands of court records are used to investigate the outcomes of the court system in Cuyahoga County, Ohio, from the racial demographics of defendants to voting for judges to the courts’ revolving door for defendants with prior charges. Or it could be releasing the data behind the story for exploration by other newsrooms, like these interactive tables that let you see if police in your state reported crime data to the FBI, a followup to a national story by Weihua Li.
In my time with The Marshall Project, we cleaned and categorized data about ARPA spending that helped us zoom out for a national overview and zoom in on local governments, for many kinds of geographic comparisons and thematic angles.
A processing pipeline speeds up your work
Another term used by the data team was “pipeline thinking,” preached by Marshall Project data editor David Eads. Journalists are used to “gathering string”, but working with data means embracing a coding mentality and thinking backward from the desired result. Journalists who start to write code often have to learn how to think like coders, which means thinking backward from the desired result. Pipeline thinking is designing your scripts from Z to A before working your way through them.
That’s where technologies that can speak to each other come in. At The Marshall Project, a commonly used combo is Python to load and prepare data, AWS to store it in the cloud using S3, and Observable notebooks to run analysis and sketch data visualizations. All of this are linked through a makefile into an automated pipeline. We also used a self-hosted API powered by Hasura (for interactive database queries), or Google Sheets and Airtable for hand-built databases.
Just like with generative journalism, an automated data pipeline involves investing in a project upfront. But as a reward, you are able to go faster and feel more confident at the end. Plus, if your dataset gets changed completely—which is not rare—you can re-run the whole analysis with one command.
For me, having a data pipeline was particularly helpful when we were visualizing a survey of U.S. sheriffs and could update the underlying data as many times as needed, playing with the format or wording, and still have dozens of visualizations in a matter of seconds.
Sometimes the right tool for the job is the simple one
Coding makes data journalism replicable, scalable, and innovative. This said, let’s not underestimate the value of the table software and off-the-shelf tools.
For some reason I thought that I’d be switching to a code environment for good, and I was wrong here, too. In fact, low-tech solutions are never off the table, for so many reasons: the need for manual input, editing, or vetting of the data; working with small datasets; using data visualization software for simple charts to save time for interactive customized products.
It is never about coding just for the sake of coding, after all. It is about coding that makes your journalism faster, bigger, and smarter.
Special thanks to David Eads for editing this piece.
Anastasia Valeeva is a data reporter with Newsday. Prior to that, she was a data fellow at The Marshall Project as part of the Alfred Friendly fellowship program run by Press Partners. She is also a co-founder of the School of Data Kyrgyzstan and an assistant professor at the American University of Central Asia, Kyrgyzstan.