Learning data Distrust Your Data

Jacob Harris on Six Ways to Make Mistakes with Data

With the launching of 538, Vox and the New York Times’ Upshot, it seems like the age of data journalism is finally here, greeted with both acclaim and concern by media critics. But data journalism is not a new thing. These new sites are just the latest iteration of news applications, which were an iteration of computer-assisted reporting, which was an iteration of precision journalism, all of which are just names for specific techniques and approaches used in the service of reporting the truth and finding the story. In other words, it’s journalism that starts from interrogating the data—and applies the same skepticism and rigor that we apply to the testimony of an expert contacted by traditional phone-assisted reporting.

All of which is to say that data journalism inherits a long tradition of journalists working with data, and that comes with the heavy responsibility to get it right. Specifically, to paraphase something I heard at a NICAR conference once: fear and paranoia are the best friends a data journalist can have. I think about this often when I work with data, because I am terrified about making a dumb mistake. The public has only a limited tolerance for fast-and-loose data journalism and we can’t keep fucking it up.

Critique is always annoying when it’s expressed in indefinite terms. So, I’m going to do something I don’t normally like to do and pick a recent example of a data journalism story gone wrong. This is not to scold those who reported it—indeed, I’m well aware of how easy it is for me to make similar mistakes—but because a specific example provides an explicit illustration of how reporting on data can go wrong and what we can learn from it. And so, let’s begin by talking about porn.

Specifically, a story about online pornography consumption in “red” vs. “blue” states that exploded onto social media a few weeks back. I first noticed it because of a story on Vox that reaggregated an Andrew Sullivan post which in turn reposted a chart made by Christopher Ingraham of the data provided by Pornhub for their study. That chain of links reflects how news spreads online these days, and yet none of those professional eyes caught some glaring flaws in the data.

Before I continue, here’s a brief summary of the findings presented by Pornhub’s data scientists. Pornhub (which is apparently the third most-popular pornography site on the Internet) was approached by Buzzfeed (which is probably the most-popular animated GIF distributor on the Internet) to analyze its traffic and determine whether “blue” states that voted for Obama in the last election consumed more pornography than “red” states that voted for Romney. And so, that’s what the statisticians at Pornhub did, pulling IP addresses from their website’s traffic logs, geocoding their likely locations and deriving a figure of total traffic for each state. They then divided the total hits from each state by that state’s population to derive a hits-per-capita number for each state. As a result, they were able to report that per-capita averages for each state and that blue states averaged slightly more hits per capita than red states.

How To Confuse Yourself With Statistics

Unfortunately, the study and the subsequent reporting derived from the Pornhub data serves as a vivid example of six ways to make mistakes with statistics:

  • Sloppy proxies
  • Dichotomizing
  • Correlation does not equal causation
  • Ecological inference
  • Geocoding
  • Data naivete

The first issues begin with the selection of the proxy. In statistics, a proxy is a variable that is used when it’s impossible to measure something directly—for instance, using per-capita GDP as a measure of standard of living. Buzzfeed titled the article about the Pornhub study as “Who Watches More Porn: Republicans Or Democrats?”. Let’s assume that’s the question that Buzzfeed wanted to ask. How would they do it? In an ideal world, they could ask every single Democrat and Republican in the country about their porn watching preferences, but this is obviously unfeasible. So, the next best thing after that would be to conduct a survey of a randomly selected group of individuals that shares similar characteristics to the national population. But that takes time and money and math, so instead Buzzfeed turned to their friends at Pornhub to derive an answer using the data they had on hand.

In this case, they used page requests to the third most-popular online porn site as a proxy for all pornography consumption and the percentage of the people who voted for Obama or Romney as proxies for registered Democrats and Republicans. These proxies are not the same thing, so distortion is inevitable. For instance, maybe in some states, people widely prefer to get their pornography via on-demand cable or sketchy video store, so they would be undercounted in the Pornhub figures. Similarly, this study uses total pageviews as a proxy for site users; the two are not necessarily the same and it’s unclear if increased pageviews means a corresponding linear increase in users. In addition, given that a large number of Americans identify themselves as independents, is it accurate to classify those voters as red or blue depending on a single election? Proxies give us a means to derive answers, but they may not always be appropriate for the questions being asked.

The problems continue from there. For their analysis, Pornhub sorted states into red and blue ones. This seems like it makes sense, but they’ve flattened a continuous variable (the percentage of the state population that voted for Obama) into a binary condition (Romney wins / Obama wins). It’s likely this dichotomizing had a palpable effect, since it makes a battleground state like Virginia seem closer to a Democratic stalwart like Vermont than its ideological “red state” neighbors in the South. Fortunately some statisticians identified and corrected for this issue, producing a more accurate scatter plot of the states vs their vote share for Obama. The result: a correlation that increased porn consumption in blue states accounted for about 16% of the variance of the state’s vote percentage for Obama. Success!

But wait. Here we stumble into two of the most classic mistakes people make with statistics. First, correlation does not equal causation. You’ve probably heard that a hundred times before, but this here is an actual illustration of why that matters. It’s entirely possible that the suggested relationship between the two variables is a total coincidence. Far more likely though is that the variables are related but only through a confounding variable that connects the two variables observed. For instance, blue states might have greater broadband penetration that would favor Internet porn. Or it could be that people in urban areas consume more Internet porn and states with more urban areas also trend Democratic. Confounding variables are common, and this piece by Jonathan Stray contains a solid overview of them and other spurious correlations. Or if you’d prefer a sarcastic look, here are correlations of voting to herpes infection or Nickelback listening. Putting it bluntly, these red state-blue state comparisons are statistical fluff, often reflecting the whimsy of the reporter more than anything real.

But what is the second mistake? For the sake of argument, let’s assume that we’ve avoided all these other problems above. Let’s decide Internet porn is a valid proxy for all pornography, that votes for a specific candidate in the last presidential election is a valid measure of party affiliation, that the correlation is not due to any hidden variables, then we can definitively say that Democrats consume more porn than Republicans, right? Wrong. Meet the ecological inference fallacy. In short, just because you’ve derived some average measure about a group that contains more of a subpopulation, that doesn’t necessarily mean it’s true for individuals in that group, especially when the difference is so slight. It’s possible that Democrats really do consume more porn and that’s what makes for the higher numbers per-capita in blue states. But it could also be that Republicans in Democrat-dominated states consume more porn than in Republican-dominated ones and that is what is pushing up the average. Or it could be that urban areas often consume more pornography and also tend to contain more Democrats but the two aren’t directly connected. We simply don’t have enough insight into the individual population to say.

And we definitely don’t have any insight into specific people based on these broad statistics. Knowing that your neighbor is a Republican or a Democrat tells you nothing about their porn consumption, regardless of the averages they derived for each population.

We’re Not in Kansas Anymore

Unfortunately, the worst error was yet to come. A lot of the early reporting on this study noticed a bizarre anomaly in the data: Kansas, a very red state, consumed an extremely high amount of porn per capita compared to the average for all other states. This is readily apparent when the numbers are graphed in a simple bar chart, but it really jumps out when the states are plotted on a scatterplot of Obama vote share vs. page hits. If you assumed, as Pornhub did, that average porn consumption was normally distributed across all states, Kansas’ average was highly unlikely. At more than 2.95 standard deviations above the average, there would be a 0.16% chance of that occurring if it were truly random. An extreme outlier like this should make you sit up and take notice as a data journalist, because it can only mean one of two things. Either you’ve really found an extreme case that reveals something bizarre and newsworthy. Or—as one reader of Andrew Sullivan’s website figured out while all the journalists shrugged their shoulders—the data is flawed.

Pornhub’s writeup omitted any explicit description of their methodology—this is never a good sign—but it seems to have involved mapping the IP addresses from which users visited the site to physical addresses and reverse geocoding those to get states. The statisticians at Pornhub (and the journalists who confidently reported their findings) assumed this was a clean process, but any programmer with experience can tell you the bitter truth: geocoding is often rubbish. What happened here was that a large percentage of IP addresses could not be resolved to an address any more specific than “USA.” When that address was geocoded, it returned a point in the centroid of the continental United States, which placed it in the state of—you guessed it—Kansas! Sadly, IP geocoding is prone to other distortions from networking architecture; for instance, at one time every user of AOL’s nationwide dialup service looked like they were connecting to the Internet from Reston, Virginia. Right now, my corporate VPN makes me look like I’m surfing the web from New Jersey even though I live in Maryland.

Of course, if we shift Kansas’ average downwards, that doesn’t change Pornhub’s hypothesis that blue states consume more porn per capita than red states. I’ve already sufficiently argued my concerns with that, but I bring up this specific error because of the central failure it illuminated. If you want to call yourself a data journalist, there is one shortcut you can never take: you must validate your data. Even the cleanest looking data might contain flaws and omissions stemming from its methodology. It’s not enough to run checks on the data itself. You must also lift your nose out of the database, ask the serious questions about how the data was collected and even use the well-honed tools of a traditional reporter to call experts when—never an if—you find questions about the data.

Doing It Better*

I know I promised I wouldn’t be a scold. But this is important. You might argue why should I care so much about a bit of viral silliness from Buzzfeed? First, I would argue it’s never just “all in fun” when you’re declaring half of the electorate more perverted than the other half. But more importantly, I don’t think the errors illustrated here are an aberration. Here’s another example of blindly trusting data to reach wrong conclusions. And another. By the hand-waving measures of traditional journalism, that’s three, making this a bonafide trend! I fear it will only get worse as publishing cycles become faster and the data analysis is done by single reporters harried by deadline pressure and nobody to cross-check their work before publication. I don’t think we can slow this trend down, but what can data journalists do to avoid slamming into these sorts of problems at full speed?

Distrust the Data

First, remember that skepticism is your truest friend if you want to call yourself a journalist. It’s not hard to see the flaws in a flimsy study if you are predisposed to contemplate all the ways in which the data is probably bad rather than tacitly accepting it as good and tested just because someone else reported on it too. If you need further inspiration, I’d suggest looking at two excellent pieces from related fields on the value of skepticism. The first of these—On Being A Data Skeptic—is a free ebook from O’Reilly that describes a similar problem gripping data scientists: the belief that quantifying a model is the same as accurately describing it. It’s where I learned to think critically about proxies. The second of these—A Rough Guide To Spotting Bad Science—is an excellent run-down of all the bad ways statistics are applied in the worst scientific studies.

Distrust the Motives

As journalists it’s also not enough to be skeptical of the data, you need to also be wary of the agenda that provided the data. What angered me the most about this study is that it was clearly framed from the start to go viral. You’d have to be willfully naive about the motivations of Pornhub and Buzzfeed to assume they wanted anything else here. And yet many sites acted as willing accomplices for a porn site that certainly didn’t mind seeing its name printed far and wide on the web. We mock publications that uncritically republish press releases, but how was this any different? Data usually comes with an agenda; few people collect data just for nothing. This doesn’t mean that you must avoid all data completely for fear of contamination. For instance, if you were reporting on water quality, it would make sense to partner with a nonprofit advocating on this issue if their data seems objective enough. Sources also have agendas after all, and they don’t prevent reporters from interviewing them. It would make less sense to uncritically use data freely provided from an industry you were reporting on. Most reporters can decide on how much they want to trust their sources, it seems like similar reasoning might apply to data.

Sniff Out The Problems

There is a concept from programming I’d also like to see applied to data analysis. As programmers add features to a system, this means writing more code and adding complexity to the system. Both of these usually mean that more bugs are added as well. Refactoring is the name for a toolkit of approaches to clean up ugly code and reverse the bloat added to programs over time. Simply put, it’s a listing of bad practices you might observe in code with suggested remedies on how to fix them. These have been called “code smells” because to an experienced coder, recognizing these problems becomes as innate as smelling something that has gone moldy in the fridge. Similarly, everyone who reports on data can name a few of their favorite “data smells”—e.g., Benford’s Law, large standard deviations, double-counted or omitted records, category fields that are manually entered—but there is no central repository for this information.

Learn Statistics

I know it sounds terrifying, but I’d also recommend learning statistics. I don’t know why I didn’t take that step in college, but I’m glad to have the option of learning with a MOOC now. Both Coursera and EdX seem to have great options. Learn statistics if you can. I don’t mean you need to learn about advanced topics like ANOVA or Monte Carlo simulations, but no journalist should report on data if they don’t understand the difference between a mean and median and what common measures of variance and spread are. If that still is too terrifying to contemplate, at least learn to think like a statistician and see how it changes your attitudes towards data.

Look Back to Go Forward

Ultimately I suspect that many mistake-riddled pieces of data journalism run aground in the same shallow seas—things like shoddy data, misapplied proxies, and botched statistics. But, I actually don’t have any data to answer that question. Greg Linch makes the important point that we should do the unpleasant job of cataloguing where the process went wrong in pieces of bad data journalism. Post-mortems are a common practice in computer programming to identify ways in which the best-laid plans go awry. That approach gives organizations insight into their own particular programming mistakes; maybe it would work for data too? As practitioners, we could start assembling a comprehensive list of data smells—of specific common problems—and gradually create a checklist of high-level classes of errors as a resource for data journalists and their editors.

About Jacob Harris

comments powered by Disqus