SRCCON: How Not to Skew Data with Statistics
A lively discussion on tricks for avoiding error
Knight-Mozilla Fellow Aurelia Moser and KPCC’s Chris Keller planned a SRCCON session to identify and explore things you can do in advance to avoid skewing your data, and to help improve your understanding of what gets done on the math side or the design side that affects the data and its interpretation.
- Can we draft a checklist for what you should be looking at with a data set before you start?
- What red flags should you be looking for when you do things based on a particular kind of design?
- Are there quick tricks people use for figuring out immediately if there are issues with a spreadsheet you get?
In the lively discussion, here are some of the most useful approaches and suggestions that emerged.
Finding the Nerds in the Basement
Journalist AmyJo Brown suggested a starting place: “If we’re building a checklist, my starting point is to interview the data source. What decisions were made and how was it collected? That’s reporting that has to occur before you start.” She works with campaign finance filings that aren’t consistent, so she’s constantly going back to the elections division—”Tell me about this report, what is your interpretation of how this got filed?”—and asking several different people the same questions, then calling candidates, how they interpret the law?
ProPublica’s Jeff Larson recommended that data journalists “Try to find that nerd in the basement.” For their unemployment insurance tracker, ProPublica came up with a simple model and found the nerd in basement to run it past him as a part of their reporting process: Do we have the totals right, right columns?
Data scientist Laurie Skelly said that advocated “a healthy distrust of authority figures or people who are smiling. Find the disgruntled person, you’ll find the secrets.” She pointed out that in her company they spend a lot of time disambiguating two people’s idea of what’s going on. She said one the most valuable tools they have is sketching. “In words you think you agree, but you draw it and realize you were on completely different planets. Drawing solves so many problems. It points out what doesn’t make sense before you’ve spent any time writing code.” She emphasized using low tech methods, and pulling people in early.
Sara Schnadt of Census Reporter and Open Elections was the nerd in the basement under Chicago’s Mayor Daley, though she wasn’t in an actual basement. She said most of the people in city government in Chicago were in an institutional bubble, they had been there for their whole careers and didn’t intend to leave. They had no way of bridging their reality with the rest of the world. “If you’re going to talk to someone at a government entity, at least identify one person who is a cultural bridge, who can contextualize the workflow.” And if you can’t find a cultural bridge in the organization, Schnadt said, then start taking things with a grain of salt and getting three or more vantage points. If you understand workflow, you have better instincts about what might be internal and quirky. But even being able to understand the workflow might be a big issue.
Keller affirmed that it’s helpful to walk through flowchart or the workflow for how a form is processed when you’re working with the data it produces, and Brown suggested going to have in-depth conversations with the people in the know when you’re not on deadline, on downtime, without a specific agenda. “Let’s just have a conversation. I don’t have notebook. It’s not for publication.”
The New York Times’s Jacob Harris pointed out that we often go to nonprofits or advocacy for counterpoint to government information, but it’s important to remember that no one works with data just for kicks, you have to always look for agenda.
Newsroom Expectations and Saying No
The group also discussed the question of when to pull the plug on a project if the data isn’t working out, especially in situations where you’ve waited months, or worked the story for months. “What do you do to say this data is crap, we’re not going to do stuff?” (Outside, perhaps, of making it a story about the entity is doing data badly.)
Keller recounted a situation when he’d wanted to pull the plug, but he was outranked by people expecting a chart to go online.
Nikhil Sonnad from Quartz talked about the temptation of data, how you might think “I’m just going to play with this and prototype and not have a hypothesis.” But he pointed out that if there’s no newsy angle, that’s a good reason to pass it up. Before you start playing with the data, brainstorm what would be the angle, he said.
Al Jazeera America’s Latoya Peterson and others brought up the difficulties of doing reliable data work on quick turnaround, and Emma Carew Grovum from Foreign Policy said she keeps a list: this is what we can do in an hour, a day, a week. And, “this is what we need from you—if you do not give me a .csv and 10 minutes of your time to explain it, we’re not doing it.” She’s in a small newsroom and has good support for this approach. WNYC’s Noah Veltman described a similar perspective. He said there’s a “pre-deadline” when his team decides whether they have the material and the idea by the time they will need it in order to hit the deadline. That it’s clear, in terms of expectations, “there’s a tradeoff between if the team wants to do something different with more time, or something we’ve done before and quick. (Full disclosure: I work with Veltman at WNYC. -KS)
Larson spoke to how ProPublica works on data where there’s the possibility of a scoop. “When we go fast usually what we do is find the one fact we can work with. You have to know your data sources. Like you can run with census as opposed to if it’s leaked emails or a spreadsheet you found on web.” To check the data, “sort min/max, make sure there are no outliers, make sure the mean and the median are pretty close, that it follows the normal distribution. Those are first four things. And then call a couple folks on phone. I think you can get that done pretty quickly.”
Skelly asked how can we keep sharing experiences? There are mistakes, she pointed out, that would have been a 10-minute conversation if people knew each other and were able to ask each other to solve a particular problem. She advocated that we publish mistakes. Let people know you’re willing to be contacted. She gave a shout out to NICAR-L, and suggested using it to find out who can answer your question, then contact them directly.
For Moser and Keller’s outline and other notes, see the public Etherpad from the session.
Independent education activist, Source Learning editor, ITP adjunct. Author of Don’t Go Back to School and Follow Me Down. Also obsessed with: relational tech, streets+strangers, cities, brains, stories.