Introducing agate: a Better Data Analysis Library for Journalists
agate, a Python library optimized for humans, reaches 1.0
It may sound obvious, but the primary role of a data journalist is to analyze data. Whether the analysis is simple or complex, it is our capacity to do this well that makes us valuable in the newsroom. And yet, most of us are still doing it using old-fashioned, error-prone methods. Specifically, we use processes that are:
- inscrutable (R, numpy);
- difficult to replicate (Excel, Google Docs, OpenRefine); and
- wedded to tools that are complex or expensive (SPSS, SAS, ArcGIS).
As journalists, we not only need to solve these problems for practical reporting purposes, but also for philosophical ones. How can we assert that our numbers are correct if we performed a series of manual processes in a spreadsheet exactly once? Do it that way and the only record of how it was done is the one in your head. That’s not good enough. Journalistic integrity requires that we’re able to document and explain our processes.
For the last year or so, I’ve challenged myself to design a better way of doing routine data analysis. It’s a problem with several parts, and today I’m thrilled to announce that the first and largest piece of the solution has reached version 1.0. It’s called agate, and it’s going to make your process better. I’ve been using agate in production for two months, and I will personally guarantee that it works.
In greater depth, agate is a Python data analysis library in the vein of numpy or pandas, but with one crucial difference. Whereas those libraries optimize for the needs of scientists—namely, being incredibly fast when working with vast numerical datasets—agate instead optimizes for the performance of the human who is using it. That means stripping out those technical optimizations and instead focusing on designing code that is easy to learn, readable, and flexible enough to handle any weird data you throw at it.
(Love csvkit? agate is all the guts of csvkit, converted to a Python library and amped up a hundred times. Sorry, Ruby programmers—maybe you can steal some of the ideas for your own projects!)
Does focusing on human performance mean agate is slow? No: computers are very fast. Except in cases where the amount of data is truly huge (scientific research, financial systems), the optimizations that make these libraries complex, such as writing large parts of them in C, are unnecessary. They also make the libraries less flexible and more difficult to use. agate does away with them to provide a simple, readable, pure-Python solution for the sorts of data analysis journalists (and many others) are doing every day.
You should be skeptical of what I’m saying right now. So I’m going to offer some strong anecdotal evidence, and then you can report it out.
Here’s an analysis written with agate:
import agate purchases = agate.Table.from_csv('examples/realdata/ks_1033_data.csv') by_county = purchases.group_by('county') totals = by_county.aggregate([ ('county_cost', agate.Sum('total_cost')) ]) totals = totals.order_by('county_cost', reverse=True) totals.limit(10).print_bars('county', 'county_cost', width=80)
(Above example updated slightly on 11/5/15 to reflect changes made for agate version 1.1. —ed)
This code has no comments and works with a dataset you probably haven’t used before. Yet, I’m almost certain that you can tell exactly what it does. (Here is a version of the same analysis, with explanation, as a Jupyter notebook.)
Here is the output you would see in the console if you ran this code:
county county_cost SEDGWICK 977,174.45 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ COFFEY 691,749.03 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ MONTGOMERY 447,581.20 ▓░░░░░░░░░░░░░░░░░░░░░░░░░ JOHNSON 420,628.00 ▓░░░░░░░░░░░░░░░░░░░░░░░░ SALINE 245,450.24 ▓░░░░░░░░░░░░░░ FINNEY 171,862.20 ▓░░░░░░░░░░ BROWN 145,254.96 ▓░░░░░░░░ KIOWA 97,974.00 ▓░░░░░ WILSON 74,747.10 ▓░░░░ FORD 70,780.00 ▓░░░░ +-------------+-------------+-------------+-------------+ 0 250,000 500,000 750,000 1,000,000
“Well sure,” I hear you say, “but I already know [SQL|R|SASS]. I’ve gotta learn how to do everything all over again!” In designing agate, I’ve been careful to keep its interface as consistent and obvious as possible. Many elements, such as columns and rows, share a base implementation—so once you’ve learned to use one you’ll know how to use all the others. I’ve also borrowed terminology from other common data analysis tools in order to speed your transition. The core table processing functions, for instance, use the names of their SQL analogues—
order_by, etc. Learning to use agate will take effort, but if you’ve done any data analysis before you should find much of it familiar.
A clean interface isn’t going to help you very much if you don’t know where to start. That’s why agate also comes with a detailed user tutorial that teaches how to use agate by working through an analysis of a real-world dataset. It assumes you know how to write some Python code, but nothing else. Beyond that introduction, there is also an exhaustive cookbook, which includes dozens of recipes showing you how to perform common tasks (sorting, searching, ranking) and how to convert code from other languages into agate code (SQL, Excel, R).
Why Use It?
Ultimately, agate is a solution to both practical and philosophical problems. If you take the time to learn it, it will make your data analysis process faster (in human terms), easier to understand, and simpler to replicate. It’ll also mean that when your editor says, “How did you get to this result?” you’ll be able to toss them a Python script that shows exactly how you reached that conclusion. Your editor doesn’t code? Then it’s even more important, because it’ll be future you whose neck is on the line when somebody calls your numbers garbage. You can find more reasons why you should use agate in the “Why agate?” section of the documentation.
It’s probably obvious by now that I want you to start using agate. Today. Use it for the reporting project you’ve got on your desk. Build data workflows on top of it. Write extensions that integrate with your data warehouse or add custom features to it. File tickets. Send me pull requests. Just by using it you’ll be helping me to build a tool I truly believe that we all need.
(Sold on agate? Then you’re going to love the solution to the second part of the problem. It’s called proof. More on that to come.)