The job of a data journalist is to turn data into a story. If you start with a spreadsheet of cancer rates, the story might be “people living near oil refineries had three times the rate of lung cancer.” Or it might not be, because you could be misinterpreting the data in some way. This recorded talk is about how not to get fooled when you go looking for stories in your data.
This lecture was given as part of the 15th Annual Science Immersion Workshop for Journalists at the Metcalf Institute for Marine & Environmental Reporting, Rhode Island. The slides are available, as is the GitHub repo with all the R code needed to reproduce the examples in the talk.
A data journalism story is usually about some sort of pattern in the data. Think of headlines like “crime rates fall,” “humans are causing climate change,” or “countries with more guns have more deaths by firearms.” What exactly are these headlines claiming, and are these stories true?
Data doesn’t speak for itself, or the data journalist would not be needed. Instead it must be interpreted. This is the process of selecting and obtaining the relevant data, finding the interesting facts or patterns, putting them in context, and explaining what they mean. But there are many ways this process can go wrong and, sorry to say, professional journalists still regularly produce erroneous stories, like this and this.
There are many reasons that you might accidentally misinterpet your data. You could choose the wrong data to answer your question, or you might not really understand how the data was collected and what its limitations are. You could believe you see a pattern that is really just a coincidence: something that is so likely to turn up by chance that it would be misleading to present it as fact. Many data stories also state or imply a causal relationship between two variables, but cause is a tricky thing and easily misunderstood. Or, you might analyze a small amount of data and incorrectly assume that the result generalizes to all cases.
Here are some basic questions you can ask to make sure that any sort of data has been interpreted correctly.
How Was the Data Collected?
Data doesn’t just come from thin air. It’s collected by specific people—or machines—for a specific purpose. There may also be people who have a financial or political interest in the numbers. For example, a police department wants to see crime statistics go down and this may affect how crimes are recorded. You must understand the data generation process, and the types of errors it’s likely to introduce. Many data journalists call this process “interviewing the data.” Here are some questions you can ask:
- Where do these numbers come from?
- Who recorded them?
- For what purpose was this data collected?
- How do we know it is complete?
- What are the demographics?
- Is this the right way to quantify this issue?
- Who is not included in these figures?
- Who is going to look bad or lose money as a result of these numbers?
- Is the data consistent from day to day, or when collected by different people?
- What arbitrary choices had to be made to generate the data?
- Is the data consistent with other sources? Who has already analyzed it?
- Does it have known flaws? Are there multiple versions?
For an excellent example of the difficulties in understanding what a data set actually records, see Matt Waite’s adventures handling data on race and ethnicity.
Is the Pattern Statistically Significant?
This is all about chance, specifically the chance that the pattern you see is just coincidental. The more likely it is that whatever you’re seeing might happen for entirely unrelated reasons, the less likely it is that you have a real story.
Do you know what pure randomness looks like? Truly random data—such as numbers generated by rolling a die—is a lot more likely to have interesting patterns in it than most people assume, and the talk includes several examples of this to try to calibrate your sense of what randomness looks like. Knowing this, it’s very important to ask what the odds are that the pattern you see is just coincidence. For a statistician or a data journalist, “what are the odds?” is not a rhetorical question but requires a quantitative answer.
Statistical testing is the process of figuring out how likely it is that what you’re seeing in the data happened by chance. Some people find this process scary, because it involves math. The majority of the talk is about statistical testing, but I’ve taken a somewhat different approach than you will find in most textbooks. It turns out that instead of doing statistics with equations, you can mostly do it with small amounts of code. I show several examples in the talk, and here are some detailed references explaining the process:
- Statistical Modeling: A Fresh Approach. This is by far the best introductory textbook on statistics that I know of, because it takes a modern computer and data-driven approach and clearly explains the underlying logic. The first five chapters are available free, and will take you up to computing confidence intervals in R, which is sufficient for many different kinds of statistical problems.
- Permutation methods: a basis for exact inference. This is a short description of some very simple methods for workhorse statistical testing, such as determining if the difference in test scores between two different schools is statistically significant. The material is somewhat dense, but the core techniques can be implemented in a few lines of code.
- Graphical inference for infovis. This is an extension of the logic of permutation methods to data visualization. It’s a wonderful technique because it applies to just about any type of data visualization that you could possibly dream up. Every data journalist should be familiar with this.
- The introductory statistics course: a Ptolemaic curriculum. This covers the history of permutation and randomization testing and why these methods didn’t make it into textbooks until recently. This isn’t how statistics is normally taught, but these methods are perfectly valid and in many ways conceptually simpler than z-scores, t-tests, and all that stuff.
I am indebted to Mark Hansen and Hadley Wickham for pointing me in this direction.
Do You Have the Causality Right?
When we say something like “cancer rates are higher near the oil refineries” what we usually mean is “cancer rates are higher because of the oil refineries.” But, as the old saying goes, correlation is not causation. For our purposes, “correlation” just means a pattern in the data, exactly the sort of pattern you’re looking for when you do visualizations. But after you find a correlation between A and B, you still have to prove that A causes B.
There are only a few ways that two variables can end up correlated. Suppose we discover that countries with more guns have more gun homicides. It could be the case that:
- Guns ownership cause homicides (people will use guns if they have them)
- Homicides cause gun ownership (people will buy guns if they live somewhere dangerous)
- Something else causes both homicides and gun ownership (perhaps poverty)
- It’s just a coincidence (use statistical testing to rule out this possibility)
The easiest way to prove that the causation runs the way you think it does is to rule out all other possibilities. First you should rule out coincidence, which is what statistical testing is for. If your pattern survives the statistical test then there is most likely some real causation somewhere, but you still need to figure out which way.
Sometimes this is simple, as when there is an element of time involved. For example, including the word “cute” in an online dating message can cause someone to reply, but a reply can’t cause you to write the word “cute” because the reply happens after you write your first message.
In other cases, as with the guns example, it can be very difficult to nail down the causal structure. Be particularly alert for a hidden factor that affects both of your variables, which is called a confounding variable.
Do Your Results Generalize?
Often when you’re reporting a story, you let a small number of cases stand in for a much larger number. For example, you might interview five students about their experiences with student debt and try to draw conclusions about all students with debt. Or you might analyze the data from one school and want to make statements about every school in the state.
Sometimes the generalization is not explicit but the audience will make a generalization anyway if you’re not very clear about the limitations of your analysis. For example, Americans think violent crime is increasing when it has actually been decreasing for two decades, possibly because the media only reports the most violent crimes. Implicit generalizations can also reinforce stereotypes or race, income, and gender inequalities. You might use Twitter data for a visualization, but that visualization really only tells us things about Twitter users, who tend to be young, middle-class, and male. This means your visualization really doesn’t say much about everyone else, but the audience probably isn’t thinking about that as they look at your pretty pictures.
Generalization of any sort is a dicey proposition, but it’s possible in certain circumstances. For example, opinion polls ask the questions of about a thousand people and generalize the answers to an entire country. This works because of very careful sampling strategies, and the price you pay is a margin of error which tells you how often you can expect the generalization to be wrong, and by how much. This talk doesn’t go into the details of polling and other generalization methods; the point here is just to make sure you ask questions like:
- Am I claiming or implying that my results generalize?
- If so, how do I know that they do?
- Might the audience assume that my results generalize?
- If so, how do I make sure that the audience is left with an accurate impression?
These four questions are hardly the last word in data interpretation, but they are powerful tools. All of the fundamentals are here. In fact, permutation tests and graphical inference can be used for quite complex analyses. But you can learn deeper details and other methods as you need them. Far more important than any technical knowledge is an intuition about what questions to ask — and the discipline to ensure that someone has asked those questions before the story is published.
You should be able to apply these questions to your own work, the work of your colleagues, and when evaluating the work of others such as scientists and experts. I see mistaken data stories every day; hopefully, yours will be better.
A previous version of this article misidentified an article of Matt Waite’s as an article by Derek Willis, a very different though equally amazing author. We regret the error.
Co-founder & Advisor of Workbench. Jonathan is currently working as a research scholar at Columbia Journalism School. He has written for the New York Times, Associated Press, Foreign Policy, ProPublica, and Wired.