Grabbing Government Data Before It’s Destroyed

Reporting back from a recent emergency data-archiving event in NYC

(Dan Phiffer)

Last Saturday morning, over 200 scientists, programmers, librarians, artists, students, and academics gathered for Data Rescue NYC to help archive at-risk scientific datasets. The event was the latest in a multi-city series organized by the Environmental Data and Governance Initiative (EDGI), an international collaboration run by non-profits and academics working to support environmental government agencies.

Saturday’s iteration, produced by Liz Barry from Public Lab and co-hosted by NYU faculty from the Department of Anthropology and Gallatin School of Individualized Study, was mainly focused on the data collection aspect of the EDGI project, which is comprehensive in its approach. In addition to downloading scientific data from public websites, the project monitors agency websites to highlight changes in how they publicly present their work. EDGI also aims to preserve institutional knowledge by interviewing retiring public sector workers.

Where to Begin

Climate-change related deletions on the EPA’s site (EDGI)

Shortly after the scheduled 9am start time, NYU visiting scholar Jerome Whitington gave an impassioned summary of the present environmental-political context. He illustrated how an Environmental Protection Agency web page on Sustainable Water Infrastructure had been edited to comply with the new administration’s political agenda:

Water and wastewater utilities are typically the largest consumers of energy in municipalities, often accounting for 30 to 40 percent of total energy consumed. Implementing energy efficiency measures at water sector systems can significantly reduce operating costs and mitigate the effects of climate change.

Introductions completed, the participants split into three tracks based on their level of technical experience. The largest, most tech-savvy group, got a brief overview on how to use a new custom workflow tool built to replace a collection of Google Spreadsheets from earlier data rescue events. The two other groups split off to flag new agency website URLs for future monitoring, and to discuss storytelling strategies to make the archival process more legible to the public.

The focus for this iteration of the event series was on the harder-to-reach recesses of agency websites, hidden behind unexpected user interfaces, to harvest the data not easily found through automated processes. In the main room, the more technically experienced volunteers were encouraged to work collaboratively, to “level up” their neighbors.

The archival process began with each volunteer researching the context and circumstances of how they found their dataset. Once a given URL had been thoroughly researched deemed worthy of collection, a volunteer would “claim” the URL so that others wouldn’t duplicate the same efforts.

Workflow & Tools

photo of two women at the Data Rescue NYC event

(Data Rescue NYC)

The workflow process is being collaboratively developed by EDGI in a GitHub document. The harvesting process begins by downloading a starter zip file meant to help standardize the collected data into a predictable format. Unzipping the starter file reveals some identifying metadata files that keep the bundle trackable throughout the workflow process, like a pet’s subdermal RFID chip. As a scaffolding, the starter also contains “data” and “tools” sub-folders for the harvested dataset itself and for scripts that were used during the collection process, respectively.

Once a dataset is collected, the starter folder is then re-zipped, and uploaded back into the workflow tool for further processing. The archive is funneled through a process of review, at each stage approved by a smaller number of increasingly trusted volunteers.

The web-based workflow tool was created by Brendan O’Brien who had attended previous Data Rescue events and wanted to improve the process of coordinating volunteers’ downloading efforts. This collaboration tool uses the Meteor framework, based on node.js JavaScript back-end, chosen to accommodate the time-sensitivity of the URL “check out” and “check in” system. O’Brien had just come from a smaller event in Boston where he’d put the system through initial testing process. At some point just before lunch, he announced, “you can now add new URLs to the workflow!” to a round of applause.

I chatted with some participants who’d found target datasets and were engaged in harvesting data. Nick Gregory, an undergraduate Computer Science student from NYU had seen a link to the event on Reddit and decided to get involved. He was waiting on 4GB of PowerPoint climate analysis to download from the NOAA website.

At a nearby table, Brandon Liu was downloading around 100GB of geodata for the entire coastline of the United States. He had spun up a virtual machine on Amazon Web Services and was keeping an eye on the download speeds. He seemed concerned that his speedy 10 mb/s connection had slowed down to 100 kb/s. “I already installed a bunch of libraries, so I added another account for a friend to use the same machine.”

At some point he closed his laptop, “it’s all happening in the cloud.” Indeed, the workflow tool itself ultimately uploads the zipped archives onto Amazon’s Simple Storage Service (S3), a popular cloud-based storage service.

Later I spoke to Jen Green, Director of Research Data Services at the University of Michigan, who was familiar with the challenges of safeguarding scientific datasets. She had hosted a previous event in Ann Arbor with other librarians, and had flown to New York to help. She explained the stakes in research university terms: “Data that’s being produced now is in jeopardy, but it’s also the historical data those researchers rely on for their research.”

Archival Integrity & Futures

After a lunch break, participants split off into classrooms to discuss future steps for the data once it had been harvested. There was a discussion about data provenance and chain of custody. How could we see when the data wasn’t copied perfectly, in its entirety? Someone mentioned the IPFS project and explained how that distributed system uses cryptographic hashes to ensure validity and avoid duplication.

Someone else asked whether a scientist could legitimately publish research citing data hosted outside of its original government context? Organizer Lou Huang hoped we won’t need to find out. “I really wish that everything we are doing here ends up being a huge waste of time.” This could all very well be a false alarm, but as software developer Matt Blaze put it: “this is what we should have been doing all along.”

Ultimately the data will be archived using CKAN, a standard data distribution system already in wide use by libraries and archivists.

That same Saturday, the USDA removed an online animal abuse database. On Sunday, a bill was introduced into Congress to eliminate the EPA altogether. And a crack in an Antarctic ice shelf grew 17 miles in the last two months.

The data-destroying precedents from Canada’s Harper administration has mobilized a growing and increasingly organized effort to avoid the same fate in the United States. Archiving public data for continued scientific research (and also for use in the private sector), has become a front-line battle for a scientific community where individuals are increasingly entering politics and gearing up for a planned March on Washington.

The next Data Rescue events are scheduled this coming Saturday February 18 in Boston, Boulder, DC, and Philadelphia.



Current page