Creating an API of Veterans Affairs’ data
Shane Shifflett and Cole Goins talk data and engagement
Last summer, the Center for Investigative Reporting published a story and map about a backlog of disability claims filed with the US Department of Veterans Affairs. That map and the data behind it is now powered by an API with data that CIR has gathered over its months of reporting on the time it takes the Department of Veterans Affairs to respond to claims for disability. Cole Goins and Shane Shifflett told us a little about the story behind the development and release of the API.
Creating the API
Q: You’ve been analyzing and visualizing data from this story for months, when did you decide you wanted to develop an API for the data?
The motivation behind the API came from two places. When CIR started to report on the backlog I (Shane) originally built a map from the data we obtained through the Department of Veterans Affairs to highlight the agency’s disability claims backlog nationwide. As we uncovered more data I realized the map wasn’t big enough to let the data breathe. The dashboard we launched last week seemed like a reasonable alternative to breaking down the data in ways that could be easily analyzed and localized. Also, the data sources Aaron Glantz, CIR’s reporter on veterans issues, was able to access weren’t always public. It took CIR months of work to gain access to some of these documents and some expertise in decoding them. We figured we could save other people the work by opening up that data through an API.
Q: What was the process like in developing the API? Were there any particularly tricky issues with the data, either technical or in terms of journalistic decisions about how and what data to share?
The API itself was straightforward. Our map and API share the same database and because much of the backend for the map was written with Python/Django, the API was as simple as adding Tastypie to the project and following the documentation.
The harder part was managing all the data. We started collecting data from publicly accessible sources for reports and our interactive map. We grew our database through PRA requests and concerned parties leaking data to us. It seemed like each new document we received was formatted differently from ones we encountered before. Each one would print data at different time intervals or used different identifiers to relate data points to a location. Worse yet, the data was referred to by internal VA codes that weren’t plainly documented. So each new document required a little reporting to figure out what it actually contained. Glantz would confirm the data with VA officials.
Then the data gymnastics could start, which generally means writing a new parser and normalizing the data so it maps to the database we created for the map and checking for errors. As I built the API I realized some of the distinctions I made about the data were trivial. They were differences important for reporting but not storing data and that complicated the API. For instance, we had two models for time series data that would have required the user to grab data from two separate endpoints. Once I started thinking about how someone other than myself would access this data, a lot of those issues became clear and had to be revised.
Q: The API also drives the new VA Data Dashboard on your site. What did you use to build that dashboard?
I built the dashboard using RaphaelJS and D3. I drew the graphs with Raphael because it’s a little better with older IE browsers. D3 provides the scales to ensure the data fits the bounds of its container and is proportional. Backbone collects all the data from CSV files hosted on Amazon S3 (most browsers) or from the servers (if you’re Internet Explorer).
Impact and Next Steps
Q: In the ongoing discussion about impact in journalism, it’s awesome to see the a page devoted to the impact of this story. How has that impact influenced how you’ve approached continuing to build out and develop the project?
Yeah, our News Engagement Specialist Kelly Chen has been using Rebel Mouse to help track and showcase how our work has been cited, from media outlets across the country to veterans’ rights groups to lawmakers on both sides of the aisle. There’s been a lot of movement around the issue in recent months, so we’ve been working hard to capture the ways that our reporting has had an effect. We really started to gain momentum after Aaron’s great story from March, and wanted to capitalize on that by offering up our data for anyone to use, encouraging widespread distribution and localization. Now that our stories have helped inform the national dialogue around the issue, we also want to use this initiative as a means to help highlight potential approaches to solving some of the problems at the heart of the backlog. Being a catalyst for change is at the core of CIR’s mission, and we want to build our investigations in a way that helps facilitate that impact.
Q: Any lessons you want to share from the development and community outreach of this project?
From a community outreach standpoint, we’ve really done a lot of groundwork to engage veterans everywhere, tell them about our work and encourage them to share their experiences with us. Having a defined audience goes a long way to really crafting effective outreach efforts, and we always try to figure out ways to take our stories to where those audiences already are, creating more reciprocal relationships and inviting them into our reporting. By including a link to a Public Insight Network form we built from our map and sharing it through partners and social media, we’ve heard from dozens of veterans who have experience filing a disability claim with the VA, adding depth and context to our reporting. We’ve also tried new outreach methods such as creating physical postcards with information about the backlog and our contact information, passing them out at veteran-related events and distributing them through veterans’ groups. It’s been tough to track the direct impact of those, but they’ve definitely help us get the word out about our reporting.
Engaging other journalists around this project is also teaching us a lot about how to make our data and reporting more useful and localizable for media partners of all stripes. We’re taking notes as folks continue to give feedback on the API and data dashboard that will help us build more efficient collaborative projects in the future. One thing that has also helped, since we’re a smaller and lesser known nonprofit news organization, is having our branding travel with the API and asking our partners to credit us in their work. We’ve received a number of calls from veterans after seeing our work elsewhere, so it’s definitely helped us get the word out about our work through other news organizations who have a larger audience than we do.
Q: What’s next for the project?
We’re ultimately aiming to highlight at least one veteran’s story for each of the 58 regional VA offices on our map, so that’s what we’re currently working toward. There are faces behind this data, and we really want to emphasize that through this initiative. If you’re a journalist interested in localizing the story for your area, get in touch! We’d love to feature a link to your work in our map. Our reporter Aaron Glantz is also going to keep reporting on the backlog, along with other veterans-related stories, so stay tuned for more in the coming weeks.
Releasing the API
Q: Do you offer any support in using the API? Have you found that the news organizations that used the data you previously reported are equipped to access the API as well?
Few of the organizations we’ve worked with have the capacity to consume data directly from the API. The data is for anyone to use and I think we’ve made it available to people of all skill levels by letting users download the CSV files we generated and offering easily embeddable charts. If someone has the capability to integrate data from CIR’s API into a larger system, that’s also great. As far as support, the dashboard itself is open source and can act as an example on how the data can be used. We’re happy to respond to any issues or questions on the dashboard’s Github page.
Q: Sharing is clearly baked in to all the work CIR does, but it’s still pretty remarkable to see “Tweet, embed this, download data” directly underneath the title on the charts in your data dashboard. How’d you decide to make those components so central?
Given the size and scope of the story, we figured the data should be accessible to anyone. What better way to do that than give users raw access to the spreadsheets or let them pack up the graphs for their own site? We wanted to make the data as relevant and useful as possible to readers and journalists by breaking the numbers down in a way that could travel easily. Effective distribution is always a priority for CIR’s products, and we made sure that the dashboard had those features front and center to maximize shareability.