COVID-19 story recipe: Analyzing nursing home data for infection-control problems
Where to find the data, how to explore it, and questions to ask to reproduce the story for your community
It started, as so much wild stuff does these days, with a tweet. Health reporter Ben Conarck at The Miami Herald found that two-thirds of Florida nursing homes had been cited for infection-control problems in the last three years. With coronavirus looming—in fact, already killing people at a nursing home in Kirkland, Wash.—this seemed like an important story with data we’d be referring to in the future.
The story we found
Facilities that struggle to control infections on regular days could have a disastrous time with coronavirus, which spreads quickly and has terrible effects on the elderly and people with health complications. That describes nursing homes well. And before long we’d found three-quarters of nursing homes had been cited for infection-control problems.
Finding those numbers and details on local nursing homes turned out to be surprisingly easy.
I used a bunch of custom Python scripting for reproducibility, but was able to redo most of the work in less than 20 minutes in Excel. USA TODAY reporter Marisa Kwiatkowski stepped in to make piles of phone calls to seek important perspective and find a painfully elusive human element to the story.
The resulting piece wound up being my first byline with the new company, and the first time I’d worked with Marisa. The data? I’m certain I’ll be referring to it frequently.
How you can analyze the data
It’s easy to get started. There’s a link to download the database from the main Nursing Home Compare search page, which you’ll be using a lot anyway. The data’s small enough you can open it in Excel, with some grumbling from your computer. (Again, Python for reproducibility could be your friend.)
- Look for the “Download CSV Flat Files (Revised) Now” link, download the ZIP, and unzip it.
You’re going to start in the HealthDeficiencies_Download.csv file. Open it in Excel, and filter on the TAG column where TAG is 441, 880, 882.
NOTE: The data covers three years of inspections. More than two years ago, Medicare changed systems; the old 0441 code became 0880, and a related code, 0882, was also implemented. If you took a shortcut on importing the data, Excel will bring these codes in just as 441 and 880. My March 3 file didn’t have any 0882 codes.
- In the bottom-left corner, Excel should tell you it’s filtered some rows—my March 3 dataset had 20,449 rows with those codes. Write this number down. You’re going to keep checking this.
- Open the ProviderInfo_Download.csv file, copy the data and paste it into your original workbook as a new sheet. Save it as a new Excel file, being careful not to overwrite your original data, and close your original CSVs. Open the ProviderInfo_Download.csv file. Save this as a new Excel file, being careful not to overwrite your original CSV. Flip back to the HealthDeficiences_Download file; copy your filtered data, then paste it into your new Excel workbook as a new sheet. (Verify you still have a filtered copy with the right amount of rows!)
In Excel, you can do a pivot table; each nursing home has a unique provider number, called PROVNUM. Set PROVNUM to be the row, and count of TAG to be your values. You now know which nursing homes have how many infection-control citations from their inspections. (Name this pivot table tab something memorable!)
Double-check: You should have about 11,537 unique provider numbers. If you scroll to the bottom of the pivot table, you’ll see a grand total. Should match that number of rows copied from Excel, which in my case was 20,449.
- Now flip over to the original ProviderInfo tab. This has your address information, patient counts and some other stuff that might be helpful.
- Come to the first empty column. Label it something creative, like VIOLATIONS. You’re going to do a lookup on the provider number, and the first provider number appears in A2. So for row 2, you’re looking up the provider number and comparing it to the provider number in your pivot table. When you find a match, you’re looking for the second column. So we’re looking in that sheet for columns A and B and making sure we give it more than enough rows.
- In Excel:
=vlookup(a2, pivot!$a$1:$b$20000, 2, false). That should get you the resulting count of violations for each nursing home where there’s a match. But some had no citations. So we can wrap it in another statement
=iferror(vlookup(a2, pivot!$a$1:$b$20000, 2, false), 0). And now nursing homes that have no matching infection-control citations show up as having none.
- Copy that formula down. Now you can do a pivot table on your ProviderInfo sheet. Make VIOLATIONS your row, so you’re seeing how many nursing homes had violations. Drag VIOLATIONS to values twice. For one of these, you want the count of violations: how many nursing homes had three citations, or four, or whatever. You can try right-clicking on the first “SUM OF VIOLATIONS” to change the field values to COUNT, rather than SUM.
- The “grand total” of the sum of VIOLATIONS should match what you filtered originally—for me, it was 20,449. In other words, I’ve accounted for every citation; I don’t have records of violations I couldn’t match to nursing homes.
Where to look for your story
The count of violations is interesting here: I saw that 3,913 nursing homes (out of 15,450) had no citations. So 25 percent of nursing homes had no infection-control problems, but 75 percent did. That turned out to be the core of the story. It was a simple, accurate, verified, easy-to-grasp number. It’s also terrifying.
Other resources helped round out our reporting. A source was able to verify we were looking for the correct codes, and a PowerPoint gave us more reassurance. A Medicaid manual has a guide to interpreting the scope of the problems, which is two dimensional: Did this affect just a patient or a room, or was it widespread? Was this a harmless problem with little potential to do harm, or was this an immediate threat? The rubric will tell you.
This dataset is rich for reporting, and we certainly aren’t the only newsroom using it. Kaiser Health News published a similar analysis this month, focused on more recent inspections, and AP is making filtered versions of the data available to members.
Looking ahead at coronavirus coverage, analyzing this data gives you an easy way to look at what’s going on in individual nursing homes, or across your state or region. A reporter at the Las Vegas Review Journal used the same approach we walked through in this piece and found that 80 percent of the state’s nursing homes have failed to meet federal standards for infection control in recent years—important context in a story about new guidelines to protect nursing home residents. You might do a story like that. You also could:
- Compare places with any sort of infection-control citation to others in your area. Perhaps you’ll see trends based on facility size or other factors. You might find trends in whether affected nursing homes are run as for-profit, non-profit, or by a government agency.
- Look up specific violations through the Nursing Home Compare site and find typical examples and extreme examples.
- There are some great summary statistics, too. How many people are living in nursing homes with infection-control problems? How many with the most serious problems? How do they compare to the nursing home affected by coronavirus in Kirkland, Washington?
- Follow up with interviews! Talk to people who can help you put the numbers for your area in context. The CDC has a list of infection-prevention questions you can ask to get started.
Living in Florida, where power outages have resulted in nursing-home patients’ deaths, I’m now familiar with something that could be critically important during hurricanes. But first we need to cover coronavirus, and our first story tried to get some aggregate numbers into the hands of local reporters. I used Python to generate this text for every county here with nursing homes. Where I live, the numbers look quite bad for thousands of my neighbors.
Mike Stucka became The Palm Beach Post’s self-titled data dork in early 2016, after finding ways to bring data into regular beat work for The Telegraph of Macon, Ga., The Salem (Mass.) News, and other publications all the way back to his undergraduate newspaper.