The Totally Incomplete Guide to Finding and Publishing Data
We’ve gathered up lots of great resources for finding the data you need to make your project amazing.
At Factful, we’re researching ways to make contemporary state-of-the-art data processing and storage tools more accessible to investigative reporters. One question driving our research was whether or not it made sense to create a large-scale data commons, a place where publicly useful sets of information could be stored, curated, and compared for the common good. Ultimately we decided that for us the answer is “no,” at least for now—there are plenty of incomplete or out-of-date data commons projects already, and building and maintaining a truly comprehensive project is a massive undertaking.
Along the way, we did compile a pretty comprehensive roundup of data repositories and commons projects that could be valuable tools for reporters, investigators, or anyone looking to increase accountability through publicly available information.
Data is Awesome
Data is an incredibly powerful reporting tool. It lets us scrutinize public spending and policy outcomes, challenge long-held conventional wisdom and participate more fully in public conversations. A decade of open data activism has left reporters and the general public with unprecedented access to public payrolls, traffic reports, police data, and much more. All of it allows us to hold policy makers accountable and understand the world in ways we couldn’t without access to the numbers.
Whether you’re a seasoned data journalist or brand new to thinking about data as a source in your reporting, there are exceptional places to find data that you may never have considered.
And if you’ve got a lot of interesting data that you’d like to share, there are some excellent tools for doing just that, none of which have the traction they deserve.
So who has data now, and how can you get your hands on it?
Start at the Source
This is not a list of every civic data repository, public data source, or research organization, but those are some of the richest data mines.
When I teach data reporting I always start with a workshop on finding data. We start by identifying a few beats that students are excited about—student loans, civil asset forfeiture, child welfare—and then we brainstorm potential sources for data on the subject.
The best way to start looking for data you need is almost always to ask yourself who could collect this data and look at where they might share it. Are there city, county, state, or national agencies that collect data? Do they publish it? If they don’t publish data, what happens when you ask for it? Sometimes all you have to do is ask, sometimes you have to file a more formal Freedom of Information request for the data.
Are there private research organizations or non-profits that keep data on the subject you’re researching?
In the data reporting class, we compile our findings in a list of tips and tricks called “where to find data.” That resource is not meant to be comprehensive. It’s meant to help you think about where to start looking for the data you need for your reporting. If you’re doing a good job, your first set of findings will leave you with additional questions. Those questions could send you back to the same source for more information, or they may lead you in a different direction. While there is no one centralized data commons to search, there is a rich patchwork of possibilities that will vary with each potential area of inquiry.
Once you’ve exhausted the direct approach, or you’re just interested in sparking some inspiration, there are a few more great places to look for data and ideas.
Newsroom data warehouses
Lots of newsrooms push cleaned data (and code) to GitHub but there’s not a unified way to find it all. The Washington Post has released a collection of data on school shootings, police involved shootings, and unsolved homicides, along with valuable context about how the data was collected and processed. BuzzFeed News maintains an indexed overview of all the data they’ve published to GitHub, as does 538. Here are a few more:
Arizona Central recently launched a data hub
Courier Journal (Louisville, KY)
Naples Daily News (Naples, FL)
New Jersey Advance publishes their data on data.world
News Press (Cape Coral, FL)
NPR Visuals publishes mostly cod
Quartz includes data along with a ton of helpful code in their GitHub repository.
Tallahassee Democrat (Tallahassee, FL)
Chicago Data Collaborative includes data that newsrooms, academics and advocates have compiled to better understand criminal justice in Chicago.
Wireservice’s Lookup repository is a collection of very useful lookup tables for BLS, IPUMS and some Fed data. (Wireservice is a collaboration between a number of US newsroom developers and data reporters.)
Most of those newsroom data warehouses are on GitHub or data.world but there are definitely more options for publishing data! Aleph, CKAN, Datasette, Quilt, and Socrata are all described below and worth a look.
And then there are repositories!
In addition to the sources above, there are some far-reaching data warehouses and repositories and tools for publishing data that are pretty remarkable, as well as a few that kind of aren’t. This is an A-Z list.
With Aleph, OCCRP, the Sarajevo-based Organized Crime and Corruption Reporting Project, is building a unified index of data. They have tackled a few important questions, including managing access to data that they can’t advertise beyond a trusted network of reporters. Aleph is tightly focused on public accountability data and includes quite a few sources obtained through leaks. The data is well organized and includes a lot of accountability and anti-corruption data that isn’t available other places. Aleph is free and open source software so hosting your own instance is also an option.
Awesome Public Data is a great big list of public data sets, organized into broad topics. Anyone can propose data for addition by submitting a pull request. Awesome Public Data does a good job of continuously checking links and flagging broken links. And they point out canonical sources rather than trying to aggregate and store data. Unfortunately, there’s no descriptive information, so users can’t skim a list and have a sense of what kind of data is available at a particular source.
Registry of Open Data on AWS is a roundup of publicly available data stored on Amazon Web Services, with great usage examples. The AWS Open Data team vets submissions so the registry includes a range of actively maintained and clearly documented data. The collection is pretty random, however: Amazon Customer Reviews, IRS 990 Forms, soil chemistry, and data from Hubble Space Telescope instruments are all there, tagged but not organized in any particular structure.
Data Portals bills itself as a comprehensive worldwide index of data portals, which it is not. At a glance, a lot of smaller cities, like Berkeley and Oakland, CA are not listed—anyone can propose new portals but the list definitely isn’t comprehensive yet.
Datasette is free and open source software for publishing data alongside a clean view of the data. They don’t maintain a commons, but if you’re looking for a good way to publish data and make it accessible for both skimming and analysis, Datasette might be a good fit.
Data.world is a data collaboration platform. They encourage users to add data, which many have done, but they don’t enforce any particular policy about preserving provenance and the site is cluttered with samples and tests. Data.world did identify a handful of sources and mirror them wholesale, eg. Uniform Crime Reports or US EPA, and some newsrooms including the Associated Press and NJ Advance keep their public data collections on data.world. Unfortunately, there’s no hierarchy to the site, or structure of any sort. Anyone can add data so there’s definitely some outright spam on the site. It’s an interesting place to search for data ideas, and maybe an interesting place to aggregate data you have worked with. But once you find something interesting you’re going to want to head upstream to make sure you’ve got current, complete records.
Enigma Public is a relatively comprehensive collection of public and semi-public structured data. Data they consider “semi-public” includes information that they obtained via Freedom of Information request. Enigma has improved their provenance metadata significantly in recent years, and the data they provide is well documented but scattershot. Coverage of major US cities is much more complete than international data. Their list of governments includes a handful of countries outside the US, but in many cases only one or two data sets are actually available. A search for “Oakland 311” turns up no Oakland results but does surface NYC 311 data, last updated 8 months ago, as the top result. NYC’s actual 311 call data is updated daily, but an Enigma user wouldn’t necessarily know that more current data is available. Enigma can be a great resource but users will want to manually check upstream if they need or want the most current data.
Global Open Data Index, compiled by Open Knowledge International (OKFN) aims to provide a comprehensive snapshot of published government data. Their data is tightly organized by nation and topic, so OKFN can show you the state of public access to national legislative or land ownership data around the world, or public data in a handful of key topic areas for any one country. It appears that the index was last updated in 2015, but their sources can help you connect with current data sources. The Global Open Data Index is particularly useful to English-speaking researchers who need to find non-English-language data and may not be able to skim a foreign language government site in search of a specific data source.
Google’s Dataset Search tool launched in the fall of 2018. Google crawls the web for data sources that include schema.org microdata, and incorporates it into search results. The result is that the data they’re searching isn’t necessarily vetted, current, or accurate—Dataset Search results include a lot of data attributed to Kaggle (see that entry, below), which is all user submitted and often detached from its original source, making it difficult to find current data upstream. As more data publishers incorporate schema microdata, however, Dataset Search will get more comprehensive.
Kaggle bills itself as a project-based data science site, but the site includes a commons of user contributed data—there were 14,000 datasets when I last looked. Kaggle’s commons is an eclectic mashup of whatever users have supplied. They encourage users to supply provenance information and human readable data dictionaries, but they don’t support automatic updates so their data isn’t especially useful as source material. Their metadata includes the date data was added to Kaggle, but doesn’t indicate whether newer data might be available from the source—which it often is. Google recently acquired Kaggle, and (not surprisingly) Kaggle data shows up a lot in Google’s Dataset Search tool.
Open Policing Project at Stanford has aggregated police stop data from 31 US states and organized the data to facilitate comparisons across states. They’re aiming to collect, clean, collate and release data from all 50 US states and have plans and funding to keep the data up to date.
ProPublica publishes and sometimes sells some data. Data they obtained through formal public records requests (i.e. FOIA) is generally available free of charge on request; data they’ve cleaned or reconciled is available for purchase and licensing. Their collection is scattered and reflects their reporting rather than a concerted effort to create a unified index of data, but they have a lot of very interesting data and they do a very good job of being explicit about provenance and limitations.
Quilt is a Python package and business that facilitates Git-like data packaging that keeps provenance intact and supports tracking of any cleaning or transformation of data. Their commons includes any and all public data that users are storing there, so the quality and usefulness varies widely. Quilt is a super interesting option for reporters and newsrooms that want to publish data or share cleaned data, so if you’re looking around for a better-than-GitHub way to publish data you’ve cleaned or transformed, Quilt is worth checking out.
Socrata, like CKAN, builds software that facilitates sharing public data. Socrata doesn’t publish a list of instances, but many city, state, national and regional governments publish public data through a Socrata portal.
Swirrl or PublishMyData is a UK-based linked data project with a lot of overlap with Socrata or CKAN. Swirrl primarily powers public data sites, eg. Scottish Government. They include a cart functionality that facilitates cross-comparisons within a given data store. Swirrl doesn’t publish a list of instances of their software, but quite a few local and national governments in the UK and Europe appear to use their software to publish public data.
Vigilant is a business that promises to track and compile public data and make it available to their customers in standardized formats. They don’t publish any data publicly.
Ally Jarmanning, a data reporter at WBUR, maintains a comprehensive guide to obtaining state court data.
Charles Ornstein at ProPublica spent ten years covering health care. His guide to covering opioids with data is required reading if you’re covering the opioid crisis, even if you don’t think you’re covering it with data.
Jeremy Singer-Vine’s newsletter, Data is Plural, isn’t strictly a research guide, but it’s great. Jeremy is the data editor at BuzzFeed News, and every week he sends out a round up of a few interesting data sets. He also maintains a structured archive of recommendations that is a great place to look for inspiration, but probably not the best path if you already know what you want.
The Quartz Directory of Essential Data is a handy and fairly comprehensive spreadsheet of important data sources that Chris Groskopf maintained while he was at QZ.
Berkeley Advanced Media Institute’s roundup of US regulatory agencies is a great resource for looking into the data that federal and local regulatory agencies maintain, and the UCB Journalism School maintains a few more research guides at newmedia.report.
The Newmark Graduate School of Journalism at CUNY maintains a series of research guides including a guide to using Census data and a round up of data resources, and their index of research databases is a great review of what is available if you have access to a library (you’ll need a library barcode to access the databases themselves but the index is a handy starting place).
Dan Nguyen keeps a thorough roundup of data reporting course syllabi that are definitely worth rooting around in—most data reporting classes ( and sometimes a few CS courses) include a lesson on finding data.
No data source roundup is complete without a loud reminder that data is only as good as the people who enter it. Before you rely on data for your reporting, you need to know who generated it and how the data you’re looking at got into the database.
Data is almost always entered by people. The fastest way to reduce the number of felony robberies in a single police precinct is to start classifying incidents as misdemeanors, and there’s good evidence that New York Police Department precincts did exactly that when the commissioner started rewarding precincts that got their serious crime rates down.
It isn’t clear why Baltimore County Police Department has more “unfounded” rape complaints than most departments nationwide, but BuzzFeed News found that many of those “unfounded” complaints were never really investigated.
Sometimes there are just quirks in the way data gets recorded—one report found that coroners don’t have solid standards about how to decide whether to record a gun death as an accident or homicide and as a result, accidental homicides are split between the two categories, making it hard to track down reliable data.
Data is powerful, but it is never a substitute for picking up the phone and making some calls. If you’re just starting to think about where data fits in your reporting process, Samantha Sunne wrote an excellent introduction to the challenges and possible pitfalls of data journalism, and how you can you avoid them.
So what do you do with all this data?
If you’re really new to data, knowing where to find it is only the beginning. You also need to get a handle on the tools you’ll use to clean, sort and understand the data.
NICAR trainings are a great way to get your bearings;
Source’s Guide to Working with Data includes a few tips for beginners;
Workbench tutorials are a great resource.
If you’ve already got a handle on the basics, Source’s regular roundups of Things You Made should inspire you to stretch your own wings a bit.
Amanda Hickman led BuzzFeed’s Open Lab for Journalism, Technology, and the Arts from its founding in 2015 until the lab wrapped up in 2017. She has taught reporting, data analysis and visualization, and audience engagement at graduate journalism programs at UC Berkeley, Columbia University, and the City University of New York, and was DocumentCloud’s founding program director. Amanda has a long history of collaborating with both journalists, editors, and community organizers to design and create the tools they need to be more effective.