Wrangling Datasets

There are so many different ways to analyze and parse a dataset—it’s part of what makes data analysis exciting. But working with data can pose major challenges, whether we’re dealing with FOI denials or just trying to free data from (sadly ubiquitous) PDFs. Most time spent on data analysis is devoted to requesting, cleaning, and structuring data, and wrangling it into a format we can actually pipe into a spreadsheet, database, or graphic. The right tool or technique can save hours of time; these resources come recommended by journalists who use data in their reporting.


  1. Cleaner, Smarter Spreadsheets Start with Structure

    By Sandhya Kambhampati

    Posted on

    I recently wrote this article on understanding how to structure data and why this is crucial, especially when building your own database. I describe some common issues with datasets that can be addresssed beforehand, such as formats for numbers and header names.

  2. The Quartz Guide to Bad Data

    By Christopher Groskopf & Quartz GitHub Contributors, Quartz

    This guide lists a variety of common problems that we see when working with data, along with solutions for solving them. It covers everything from dealing with text that’s been converted to numbers to using data where the units haven’t being specified. The guide is also available in Chinese, Japanese, Portuguese, and Spanish.

  3. The ProPublica Guide to Bulletproofing Your Data

    By ProPublica

    With every analysis you do, you’ll want to make sure you take good notes/documentation and have someone bulletproof your work. This guide highlights ways to bulletproof data work and explains how to implement these things in your workflow.

  4. Understanding Households and Relationships in Census Data

    By Anthony DeBarros

    Posted on

    When using census data, you’ll want to keep the caveats and changes to the surveys in mind especially when you’re looking at data over time. This piece from Anthony DeBarros explains how to understand and properly use data on households and relationships. The piece walks you through the different tables that the Census Bureau collects and also links to, which is a great resource for sorting through the data.

  5. The Quartz Directory of Essential Data

    By Christopher Groskopf

    This Google Spreadsheet from Quartz list data sets, mostly on a national and global level. It breaks down the source of the data, how granular the data is and links to some Quartz pieces where the data was used.

  6. Excel Magic

    By Mary Jo Webster

    There are many tools you can use for data analysis, but the gateway to getting you hooked on databases is Excel. This Excel cheat-sheet (PDF) from Mary Jo Webster is a comprehensive guide to many commonly used formulas. It covers lookups, sums, and string functions, and it includes a sample practice dataset.

  7. Tabula

    By Manuel Aristarán, Mike Tigas, and Jeremy B. Merrill

    PDFs are notoriously difficult for working with data, and yet more often than not, data is provided to journalists in this format. Tabula is a tool to add to your data-cleaning tool-belt. It allows you extract data from a PDF into a Excel spreadsheet or CSV.

  8. Csvkit

    By Christopher Groskopf

    If you work with data, you might need to convert Excel files to CSV or JSON to CSV. Csvkit is a command line tool that makes it easy to convert files and and merge datasets. It also has some handy functions for cleaning up CSVs and running some basic descriptive statistics.

  9. MuckRock

    If you’ve ever filed a FOIA, you know that keeping track and staying on top of your requests can be tough without the right tool. MuckRock is a repository of requests that are currently filed, and it has a FOIA tracker so you can also track your process.

Current page