Black Box Be Gone: Tools for Human-Optimized Data Analysis
Suggestions for your team, plus an intro to our toolkit at DataMade
Transparency is a basic tenet of journalism, yet it remains a challenge in journalistic data analysis. Reproducible data workflows strive to address this problem by making transformations and calculations replicable with a single command. A growing number of newsroom developers are adopting this practice, in part because scripted data work produces an audit trail for readers–provided those readers can code.
Literate data analysis is reproducible data work that’s meant to be shared. A literate approach combines the convenience of reproducible data work with the broader transparency of plain language, which makes methodology clear to any reader, regardless of whether or not they understand the code.
More simply, a literate approach to data analysis is a democratic approach to data analysis.
The principles of literate analysis date back to a 1984 article in which computer scientist Donald Knuth invited programmers to combine source code with documentation in service of human comprehension: “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”
Knuth’s standard called for prose of a quality on par with literature, hence its name, “literate programming.”
Choosing a Toolkit: In Theory
Many toolkits can facilitate literate data analysis, but they all share a few essential traits:
Code and prose live together in a single input file.
The input file facilitates collaboration.
The input file produces the output you need.
You might notice I haven’t mentioned a language. At DataMade, a civic technology company in Chicago, we like Python for data analysis because it’s notoriously easy to read. However, Python itself is not essential to a literate approach.
Let’s unpack the essentials of a literate toolkit. For fellow Pythonistas, we’ll also assess two popular Python toolkits for literate data analysis: Jupyter, a popular newsroom option, and Pweave, our framework of choice at DataMade.
Code and Prose Live Together in a Single Input File
Literate analysis files are easy to read because you organize them according to human logic rather than computer logic. That means you can pair plain-language assertions with supporting data and charts, or document mathematical operations right next to the code that performs them.
Jupyter notebooks provide a browser-based interactive development environment for mixing code with markdown and plain text, making them easy to hack on, and they’re rendered nicely in GitHub, making them easy to share.
For R aficionados, Pweave cribs heavily from Sweave and knitr in that it exports combined code and prose inputs into multiple, flexible output formats. In contrast to Jupyter, you’re free to compose Pweave input files in the development environment of your choice.
The Input File Facilitates Collaboration
A literate approach should make working together easy, whether you’re collaborating with an editor or laying the groundwork for others to extend your finished analysis.
To review each other’s work, we use git diffs, or a side-by-side comparison of changed files to their previous versions. Jupyter notebooks look friendly in Github’s file view, but the underlying JSON blob isn’t so pretty when diffed. An ugly diff makes it hard to tell what has changed from one version of a file to the next. By contrast, raw Pweave input files look like plain text, so it’s easy for us to track changes—and catch mistakes—as we conduct analyses.
If you are less concerned with version control and more concerned with collaborative editing, a new project called Colaboratory allows multiple users to work together in the same Jupyter environment.
The Input File Produces the Output You Need
Literate programming was conceived as a way to produce both code and documentation from a single, logically organized input file. Similarly, you should be able to produce the outputs you need using the input your toolkit requires.
Jupyter is opinionated about the look and feel of output you can derive from it, making it less than ideal if you require more than the standard text and code presentation. Conversely, Pweave inherently facilitates exporting input files to the output you need.
Imagine writing a story interspersed with supporting code chunks in a Pweave input file. Just need the text for your content-management system? You can configure configure Pweave to omit the code. Need a methods sidebar? Convert documentation from the input file into HTML (or markdown, or LaTeX…).
Choosing a Toolkit: In Practice
As a small company that serves a diverse audience, we invested in literate data analysis because it ensures that we can share our work with clients, the public, and fellow data wranglers alike. Here’s how we went about selecting our toolkit.
In addition to the general requirements for literate analysis toolkits, we identified specific requirements for a literate analysis toolkit that would work for us: The framework must be Python friendly, the input must play well with version control and compile to LaTeX, and the output must be generated from the command line.
While Jupyter has clear merits, it didn’t fit our versioning workflow, so we went with Pweave. It soon became clear that Pweave sorely lacked Jupyter’s interactivity. Tired of compiling and recompiling the same output document just to fiddle with chart styles, but unwilling to give up on Pweave, we started looking for a solution.
We searched, emailed, and seriously considered writing a Python equivalent to R-Box, a Sublime Text extension for interactive development in R, before we found instructions for replicating Jupyter-style interactivity with Pweave input files. We’ve since extended the instructions to include Jupyter’s beloved
shift + enter keyboard shortcut for running code cells.
Want to give our Pweave-centric toolkit a spin? Let us save you the time we spent working out the kinks. We’re compiling installation instructions, environment setup, tutorials, and examples in a public Github repository.
Hannah Cushman is a journalist turned software developer. She believes in open information and empathy, and pursues both at DataMade, a civic technology company in Chicago.