All About Transcribable
Al Shaw breaks down ProPublica’s latest open source tool
Yesterday, ProPublica released Transcribable, a new open source tool that makes orderly crowdsourced transcription available to any organization that uses Ruby on Rails. Transcribable is is a fully documented Rails plugin based on the code that underlies ProPublica’s Free the Files project, which we’ve discussed elsewhere with ProPublica’s Al Shaw. Shaw introduced the project to the public in a post on ProPublica’s Nerd Blog:
Since we wrapped up our Free the Files project after last year’s U.S. election, many people and organizations have asked us how they could build their own web applications like Free the Files to crowdsource their caches of documents. […] Transcribable allows you to drop a RubyGem into your Rails app, and instantly add “transcribability” to any attribute on a given model.
Last afternoon, Shaw answered our questions about the project’s background and about ProPublica’s emphasis on “stealable” code.
Q&A with Al Shaw
Q. Transcribable looks like an enormous time-saver for organizations that need crowd-sourced file transcription capabilities. What are the major things teams won’t have to build themselves if they use Transcribable?
Transcribable will automatically build out the entire system for collecting transcriptions, assigning them to users and verifying them. Organizations will still have to build out the part of the app that will let people explore the results of the collected data, because those apps will, of course, be different based on the kind of data collected. After you specify which bits of your documents, you’d like users to transcribe, Transcribable will actually generate much of the code for you based on those attributes, including a customized collection form. There’s also a script that will collect all the documents and stick them in your database, if they’re organized into a DocumentCloud project.
Q. How difficult is Transcribable to set up, and what are the system prerequisites for using it?
The biggest roadblock is: you need to know Ruby on Rails. Transcribable assumes you already know Ruby and Rails. But, once you’ve got a fresh Rails app, setting up Transcribable is extremely simple. The only work you’ll have to do is customizing the look and feel of the form, and tweaking the algorithms for assigning files and verifying responses. If you want to, say, weight the way files are assigned (we did based on how “swingy” a TV market is), you can override the defaults.
Q. This looks like more than just a gift of open-source code. There are design choices embedded in Transcribable (“casino-driven design,” etc.) that your team refined over time, no? That seems like a benefit all on its own.
Our concept of casino-driven design, is something we worked on a lot in both Free the Files and Message Machine, our crowdsourced app for analyzing political campaign emails. In many ways, it’s the crux of Free the Files’ success, so it’s natural that we’d want to package it with Transcribable. As it stands, the transcription page view is the only bit of design packaged with the gem. Everything else is up to the organization to style. One of the coolest parts of Transcribable is that it actually generates out the “casino” page specifically based on the attributes you want transcribed automatically, and if you add more fields you want transcribed later on, you can also regenerate it.
Q. What are the major differences between Transcribable and the codebase you used for Free the Files?
The biggest difference is that Transcribable has no implementation-specific code. Free the Files has tons of code tailored specifically to gathering and interpreting FCC data: geographical queries based on TV station transmitter ranges, FCC scraping scripts, a weighting algorithm based on swing states, verification based on specific attributes, the ability to skip files because they’re not invoices, the list goes on and on. We extracted the most important methods needed for the assignment and verification processes, and simplified the controllers and views for Transcribable. Free the Files also has a complicated login system with Facebook support. We ripped that out and went with a very simple cookie-based system for assigning users. Organizations that want to use Transcribable for complicated projects will probably want to implement their own login systems, but we didn’t want that to be a decision Transcribable makes for you, so we went with the absolute simplest system that works out of the box.
Q. We all know how much time it takes to tie up loose ends, create documentation, and open-source code in a way that’s genuinely useful to others. How does your team manage set aside resources for efforts like this? And why is it important enough that you actually take the time and energy to do it?
Since we released Free the Files, a lot of people and organizations have come to us because they’ve wanted to do similar things. We’ve invited people in, and sheepishly shown them the rats’ nest of code hoping they may be able to grab bits and pieces out, but that code was usually too specific for people to do anything with, so we set out to extract the useful bits. It takes time to actually release something in a way people will find useful, which is why we don’t open source our full news applications, only abstract components or tools we build along the way. Just like our stories are “stealable,” giving away our code has always been a big part of our team’s mission, so we try to make the time to do it right when we can. We’re heartened that projects like StateFace, TimelineSetter, and TableSetter have been used by dozens of other organizations, so that makes it worth it for us.
Q. Anything else you’d like to share about Transcribable that isn’t in the official announcement or the repo?
The version of Transcribable we released today is really just a first draft. At the end of my post about casino-driven design, I wrote about a few things we want to do in the future in the area of crowdsourcing—stuff like computer vision, different ways of assigning documents and tasks, asking users to draw boxes around interesting bits of documents and using OCR to parse the data out of them.. These are all things that could be added to Transcribable that all organizations could benefit from. Of course feature requests and pull requests are more than welcome over on GitHub.