Do You Want This Life?
A data journalism intern’s familiar tale of public-data woes
I spent the summer of 2017 as a Dow Jones News Fund intern at Reveal from The Center for Investigative Reporting. Before this internship, I had been studying at San Francisco State University in broadcast journalism in the Broadcast and Electronic Communication Arts program, so my journalism experience was really from interning at Reveal and producing audio pieces in the San Francisco Bay Area. After spending a week at the University of Missouri for a data journalism bootcamp through the Dow Jones News Fund and IRE, I felt inspired to write a story with impact—a story like the Mercury News’ investigation showing that 80 percent of Oakland Fire Department warnings went unchecked.
Months before that data journalism workshop, David Herzog, the Dow Jones News Fund residency coordinator, had suggested that I start looking into my city’s and state’s open data portals. This was my first time using public data, and I remember thinking it was going to be so easy. The data was right there, in a good format, in public, just waiting to be analyzed. And at the data bootcamp, I had picked up some data-cleaning skills, so I figured fixing any errors would be no sweat.
I didn’t take into account the amount of work that a data journalist goes through to make a project real. Every time I thought I found a lead in the data, every time I felt confident, and every time I felt a glimpse of hope in the story, it turned out to be nothing.
But that’s the reality of data journalism, isn’t it?
If It’s Too Good to Be True…
I spent three months looking at data from San Francisco’s open data portal. I looked into four San Francisco Fire Department datasets—on incidents, inspections, violations, and complaints—over and over again.
One of Reveal’s data reporters, Sinduja Rangarajan, sat down with me and explained that I had to vet each column of each dataset to really understand what I was working with. I had to validate the data to check for any errors like typos, variations of spelling, extra spaces, or if anything seemed out of the ordinary (which can be good or bad)
I selected the entire column in Excel, counting the number of times a record appeared, sorted it (both in ascending and descending order) and looked at it in alphabetical order for any written mistakes in the data. I looked at the count of a column per year. I checked to see if there was a trend.
I spent hours writing notes on each dataset into my data diary. That helped out when I needed to refer to what I thought were some great, interesting stories in the data that I checked. I may have gone a little overboard and wrote about 25 pages worth of notes—I was excited that I was seeing things in the data.
I saw a huge jump in violations and complaints from 2015 to 2016. So I compiled some aggregate numbers and wrote down some questions to challenge fire officials.
In early July, I sent an email to SFFD’s public information officer with queries showing the number of violations and complaints per year, and I asked them to connect me with staff who knew about the data. While I was on the phone with the fire department with my queries and notes, I pointed out that jump in 2016 for the two datasets.
But the public information officer shot me down. When they checked, the data for violations and complaints I downloaded from DataSF turned out to have been duplicated—it was unusable. The datasets that I pulled from the city’s data portal and SFFD’s internal datasets weren’t matching. The problem was how DataSF was processing the two datasets, which meant that records were being copied and showing more than there actually were. Instead of making me wait a few days for the corrections to appear online, SFFD sent me their internal violations dataset.
A month had passed by this point, and my confidence had started to drop. I talked the problem over with a couple of the data reporters who had been helping me along the way, and we changed tack. We decided to look into why 2016 had more violations in the department’s data than did prior years.
And…it turned out to be nothing. Lieutenant Mary Tse of the fire department explained that a change in the notice of violations policy could have contributed to what seemed like a high number of violations. The SFFD argued that the notice of violation policy now counted what used to be just a “warning” as a violation—although they still gave owners the opportunity to fix the issue instead of paying a fee.
So back to step one.
I downloaded all four datasets again and checked all the numbers for errors and duplication, again. (SFFD’s IT manager told me that the incidents and inspections datasets should be fine, but I double checked them just in case.)
I decided to join complaints to inspections on the inspection number. When I joined the two datasets, I noticed that there were 1,094 records without inspection numbers, but these turned out to be complaints filed within 2017 where the location needed to be inspected. The DataSF complaints dataset was also different compared to the set from SFFD. It looked like a dead end.
So I tried another angle. It was July 25th, and I had a month left in my internship. My next step was using employee compensation data to show the number of SFFD staff by position per year—but the data wouldn’t show that. That data shows who got paid and the amount of payment for the year, but not which people worked as staff that year: some were retired.
The compensation data also didn’t show the number of hours spent working per year, so I wasn’t able to say who worked full time or part time. It was no use. I had spent more than a week waiting for the San Francisco controller to send me the spreadsheets of all employee compensation data per year, then I didn’t even use it.
It was early August now, and I decide to check on the updated incidents dataset, where I spent weeks now analyzing the dataset and focused my attention on a story about false alarms.
I read reports on false alarms, unwanted alarms, and smoke alarm activations, and I spoke with fire experts on these issues—it made me feel like a minor expert. I felt like I was finally off and running toward a good story. I knew more about how fire departments record incident types using the NFIRS guide than I did about taking local public transportation.
I got so far along that I wrote a draft story based on my conclusion that 40 percent of all the incidents the San Francisco Fire Department responded to were false alarms.
But when I sent my findings to the department, the story fell apart again.
I didn’t realize that the fire incidents dataset from DataSF only included “non-medical incidents,” so it wasn’t even close to the complete picture. The non-medical incidents data comprised only 20 percent of all incidents.
When I went back and looked at the fine print, sure enough, the open data portal described the Fire Department incidents as “non-medical incidents responded by the SF Fire Department.” But even then, it didn’t say on the data portal that this dataset is only about 20 percent of the department’s workload. The public information officer told me the rest of the data for medical incidents is not available because it is protected under the Health Insurance Portability and Accountability Act.
It felt horrible being told that after all that work my story was wrong. Instead of 40 percent of all incidents responded being false alarms, it was 10 percent. A bit above the national average of 6 percent, but it didn’t constitute a crisis.
I sat at my desk and looked at the fire department’s responses to the questions I had sent weeks earlier, and thought, “I don’t know what to do now.” I carried my laptop to my editor and said the fire department had replied, and didn’t say anything else as he read their replies on my screen. Afterward, he told me of an experience he had where an editor joked with him to not “over-report” a story—don’t poke that story too hard with a stick, or it might fall apart. A lot of times things are too good to be true.
Do You Want this Life?
By now it was the end of August, and my internship was supposed to be over. I had spent all summer looking into data that I didn’t even use. I didn’t know what else I could do. Each time I knocked on my editor’s door I felt a little more defeated, and that final response from the fire department kicked me in the gut.
I felt like I lost a battle that may have only played out in my head—like a competitive runner using all of their strength to cross the finish line…against a random kid on a bike.
When you go through all this trouble working with public datasets, you really have to ask yourself, “Do you want this life?” Here’s my answer:
Maybe. I’ve learned so much this summer about data journalism, but I don’t see myself exclusively being a data journalist. I’ve had more experience producing audio pieces, and I see myself continuing my experience with producing and reporting—but that doesn’t mean these skills that I learned will be lost. I saw how the data team from Reveal worked through all of their projects, and they were so helpful and understanding of all of the issues I faced. This is totally normal for them, and totally new for me. I have so much respect for data journalists, but I don’t see myself dedicating myself completely to data journalism. That said, I’m preparing for another internship that will focus on data reporting, and I’ll try my best to improve on the skills I’ve learned so far.
David Rodriguez was a 2017 Dow Jones News Fund data intern at Reveal from the Center for Investigative Reporting. He’s written for the San Francisco’s bilingual newspaper, El Tecolote and and is a San Francisco State University alumn.