Upload a PDF, get back tabular CSV data. Poof!
As many developers and data reporters know, dealing with data tables in Adobe Acrobat PDF files is a pain in the rear (to put it lightly). Part of the problem is that PDF is not a data format so much as an electronic paper format. Another part: existing extraction tools, such as xpdf/Poppler’s pdftotext, aren’t designed for data tables and aren’t exactly human-friendly.
Jeremy B. Merrill recently wrote a first-hand account of some of the difficulties ProPublica encountered as they released a massive update to their Dollars for Docs interactive database. During this project, ProPublica used an internally-developed command-line utility named Farrago, which utilized computer vision techniques to detect and extract data from tables in PDF files.
At the same time, Knight-Mozilla Fellow Manuel Aristarán was working on the initial stages of a web app he called Tabula, drawing upon document analysis ideas found in academic papers by Tamir Hassan, Burcu Yildiz, et al, and Emerlinda Oro, among others.
Today we’re pleased to announce the initial public release of Tabula, a collaboration born out of these previously separate projects. Tabula is free and available under the MIT open-source license. Tabula lets you upload a (text-based) PDF file into a simple web interface and magically pull tabular data into CSV format.
- You can play with a restricted live demo here to get an idea of what Tabula can do. (We’ll get to the details in a bit, but the processing steps are quite computationally expensive, so providing a live, publicly usable instance of the webapp is unfortunately beyond our means at this time.)
- Tabula is free software, so you can also set up your own instance. Instructions are available in the README.
How it works
The goal of the PDF format is to display exactly the same way across a wide range of platforms. The most relevant information that Tabula uses to recognize tables is the position (x and y coordinates) of each individual character on the page. We get that data by running the PDF through a JRuby script that drives the Apache PDFBox Java library to generate XML output similar to this:
<page number="1" position="absolute" top="0" left="0" height="792.0" width="612.0" rotation="90"> <text top="10.60" left="341.89" width="7.21" height="4.89" fontsize="9.96" dir="90.0"><![CDATA[C]]></text> <text top="10.60" left="349.10" width="7.21" height="4.89" fontsize="9.96" dir="90.0"><![CDATA[R]]></text> <text top="10.60" left="356.30" width="6.66" height="4.89" fontsize="9.96" dir="90.0"><![CDATA[E]]></text> <text top="10.60" left="362.96" width="7.21" height="4.89" fontsize="9.96" dir="90.0"><![CDATA[D]]></text> .... </page>
(We use this XML intermediate representation for historical reasons. Replacing it with something more efficient is on our to-do list.)
At this point, we have a collection of single characters and their positions. Making matters worse, text in PDFs usually don’t contain space characters; instead, the first character of the next “word” is simply placed slightly farther to the right. As Jeremy points out in his fantastic blog post, “A reader can’t tell the difference, but to a computer the difference is a big pain.” We merge characters into words and add spaces—if necessary—using a set of heuristics.
Next, we detect the boundaries of the table rows. If the table contains ruling lines to separate rows, we use their position to generate the boundaries (top and bottom) of each row. The lines are detected with a computer vision technique called Hough transform, as implemented in the OpenCV library.
A table with ruling lines
Tables without row or column graphic separators are also common. For these type of tables, we cluster together the words that vertically overlap each other. The row boundaries are the bounding boxes of each detected cluster of words.
A table without graphic separators
An analogous procedure is then carried out for detecting column boundaries. Tabula clusters together words that overlap horizontally. The bounding boxes of those clusters are the column boundaries.
Now we that we have recognized both rows and columns, we are can output the data as a list of rows, ready to be displayed as an HTML table in the browser or to be rendered as CSV for download.
- Scanned PDFs: Tabula only works on text-based PDFs only, so you’re still stuck with manual labor if you have scanned PDFs. Free OCR technology is not quite to the point where we’d trust automating it with many pages of data. For those files, Raleigh Public Record’s DocHive is worth a look.
- Multi-line rows: PDFs with multi-line rows (word wrapped text) are often mis-detected, particularly in tables without graphic row separators.
- Automatic table detection: We’re working on automating the detection of tables. For now, you’ll have to do a manual rectangular selection around the candidate tables.
- is no easy task. Tabula works best with tables that don’t contain rows or columns spanning several cells.
How you can help
Tabula is far from complete and our to-do list is growing faster than we can keep up. Here are some ways you can contribute:
- Set up your own server. Was it difficult? Fork the repo and help us update the documentation. Run into any issues along the way? Let us know about it.
- Table extraction is highly heuristic; it will work with some files and fail miserably with others. If you’re running your own server and find problematic PDFs, please file a bug report in our tracker and attach the offending document.
- We’re open to patches; just go the usual git-way: fork, commit, push and send a pull request. If you’re hacking on the table extractor, please make sure your patch includes a unit test for the case you’re fixing, (see examples).
- Get in touch: we’re often in the #opennews room on irc.mozilla.org. Outside of the GitHub project, you can also follow Tabula and communicate with us on Twitter: @TabulaPDF.