Meet Disputed Territories and SSN Redactor
The first two projects out of the #owhack Hack Day
Over the weekend, we put on the third Knight-Mozilla-MIT Hack Day, leading into the 2014 MIT-Knight Civic Media Conference. As usual, the hack day was loosely organized around the conference’s theme: this year, “The Open Internet and Everything After.” After 24 hours of hacking in the welcoming environment of the MIT Media Lab (spread over two days because we believe in sleeping), we ended up with six wonderful projects ranging from an ultra-practical redaction utility to a fake astroturf campaign againt Net Neutrality.
In the next two days, we’ll be outlining the projects here on Source, introducing the people behind them, and offering hooks in for anyone who wants to jump in and help finish, test, and promote the fantastic work that came out of the event. And we’ll introduce the first two right now.
NPR’s Alyson Hurt pitched this project, a look at how Google Maps renders disputed territories differently depending on who’s looking (or rather, where they’re looking from). We asked her why:
What interested me was the idea that “facts” are different depending on where you are, that we’re not all seeing the same map. And that Google will redraw a country’s boundaries (at least for users from that country, if not all users) to suit that country’s demands—showing something as settled fact, rather than in dispute. The purpose of the project is to show that it’s happening, and point out notable differences.
Also, I thought it was really interesting that Google didn’t redraw boundaries for all countries with border disputes—for example, a few disputes in Africa remained as dotted lines in the various country views I checked. But disputes involving particularly China, India and Russia—big countries with a lot of power and business interest—were more likely to be “resolved” in the maps. (Google has noted in statements before, re: Crimea, that it follows the law in the countries it serves. And as far as Russia is concerned, Crimea belongs to them—a new “fact”—even if others still consider that a matter of dispute.)
Over at Quartz, David Yanofsky picked up on the project in its current form, and we asked project contributor Gus Wezerek about their hopes for its future:
I think we’d all like to add a lot of context to the piece. Right now the project stands as a survey of Google’s different worldviews. But unless you’re an expert on the conflicts in questions, it can be hard to parse the significance of which boundary falls where. So annotations and friendlier descriptions for each disputed territory are probably at the top of the list. Also, the differences between some maps are as subtle as a dashed or solid boundary line. To that end, I think we’d like to tweak the UI to make it easier to pick out what changed. That, and a new title for the project to replace “Disputed Territories.” Even Wikipedia sells the topic better with its “Cartographic Aggression” page.
I think it stands well enough now as a snapshot in time. Long-term, if this is to be more of a tool, I think it’d be good to:
- Definitely add more context, as Gus said.
- Connect the screen capture script to the Google Spreadsheet we’re using as a database. (Currently, all the map URLs are hard-coded into the script.)
- Better automate adding disputed territories to the list and identifying their lat/lon.
- Allow users to enlarge or zoom in to a map image.
Contribute to the project and check out the team’s initial documentation on GitHub.
Team: Waldo Jaquith, Gabriela Rodriguz, Manuel Aristarán, Ying Quan Tan, and Jonathan Stray.
Special note: Please don’t use this just yet, it’s not quite finished and not secure! See below.
This project, pitched by Waldo Jaquith, built the initial groundwork for a command-line redaction tool for Social Security Numbers in PDFs. Asked why they worked on a redaction tool at an open internet hack day, Jaquith and Gabriela Rodriguez wrote that “If we want the government and companies to open their data, we need to create tools to help them redact sensitive information. PDF is not the best format to release data, but it’s in widespread use, and this makes it easier for responsible publication of documents that bear sensitive materials.”
Contributor Jonathan Stray added:
It’s not very glamorous, but it’s a real problem! Quite often governments refuse to release documents because they contain personal information and it would be too expensive to remove it. Hopefully this tool will make it cheaper, and therefore lead to more open data.
I asked how close the tool was to being production-ready, and Stray responsibly warned against its use:
This tool is not ready for use! We write out a black box over SSNs but the information is still available in the PDF file on the layers underneath it. Do not use yet! Not safe! We will fix this soon.
…and then a more detailed breakdown of the project and its status:
We wrote code to read in a PDF, find strings of digits that look like SSNs, and draw black boxes over them. Fortunately we were able to get a test data set of PDFs that contain SSNs, and a laboriously human-created spreadsheet of which pages the numbers appear on. This allowed us to make an automated test system.
Results from our 24-hour hack: we detect 85% of SSNs. Of the numbers that are blacked out, 45% are not actually SSNs. (Our accuracy rate is almost certainly much higher than this, but we didn’t have access to decent OCR software when creating our test corpus of data.) We think we can improve both of these numbers, but we will always have to err on the side of assuming any 9-digit number is private information.
The team is actively calling for help finishing the code, continuing to test it and improve its accuracy, and then getting it to people who need it, so please check out the repo and jump in.
This afternoon and tomorrow, we’ll intro four more projects, including the fake astroturfing campaign and a filesharing tool named after famous baking elves.