Learning Sane Data Updates Are Harder than You Think
First in a three-part series by Adrian Holovaty about hairy data-parsing problems from a journalist’s perspecti
This is the first in a series of three case studies about data-parsing problems from a journalist’s perspective. This will be meaty, this will be hairy, this will be firmly in the weeds.
We’re in the middle of an open-data renaissance. It’s easier than ever for somebody with basic tech skills to find a bunch of government data, explore it, combine it with other sources, and republish it. See, for instance, the City of Chicago Data Portal, which has hundreds of data sets available for immediate download.
But the simplicity can be deceptive. Sure, the mechanics of getting data are easy, but once you start working with it, you’ll likely face a variety of rather subtle problems revolving around data correctness, completeness, and freshness.
Here I’ll examine some of the most deceptively simple problems you may face, based on my eight years’ experience dealing with government data in a journalistic setting —most recently as founder of EveryBlock, and before that as creator of chicagocrime.org and web developer at washingtonpost.com. EveryBlock, which was shut down by its parent company NBC News in February 2013, was a site that gathered and sorted dozens of civic data sets geographically. It gave you a “news feed for your block”—a frequently updated feed of news and discussions relevant to your home address. In building this huge public-data-parsing machine, we dealt with many different data situations and problems, from a wide variety of sources.
My goal here is to raise your awareness of various problems that may not be immediately obvious and give you reasonable solutions. My first theme in this series is getting new or changed records.
The Problem: New and Changed Records
Say you’re running a site like EveryBlock, which collects a bunch of public records—daily, hourly or in real time—and it shows users the new stuff in their neighborhoods. Or, say you’re building a government-data mashup that displays crime data on Google Maps. Crimes happen daily, so you’ll need to update the site regularly to keep it relevant and useful.
In both of these cases, you’re relying on data from a third-party source that’s regularly updated. You need to be able to update your own application with the latest data, and, in order to do that, you need a way of detecting what’s new in that data source.
One way to do this, a Gordian Knot solution, would be to clear your local copy of the data each time you update it. This way, your local copy will always be exactly the same as your source, and you can simply delete it all and reimport every time you want to refresh your data. At face value, this is an enticing approach: it’s conceptually simple, it’s easy to implement and it avoids the hairy problems I’ve listed above (such as keeping data current). But it brings in new problems that might be a dealbreaker, depending on the data:
- It breaks URLs. If you’ve got permalinks for each record, and those URLs are generated based on a database ID or some other record-specific data, those links will likely break if you delete the data and reimport it. Remember, Cool URIs don’t change. It’s good practice not to break other people’s links to your site.
- It breaks relationships to other content. Say your site lets people comment on crimes. If I comment on a crime, then you delete the crime records and reimport them, you’ll have to reassociate the comment with the correct record in the reimported data. That’s doable but painful to manage, and it gets more painful the more content relationships you have—comments, favorites, page view counter data, etc.
- It’s inefficient. If it’s a huge data set, it might take way too long to do a complete reimport each time you update. And it’s inelegant to delete-and-reimport data that hasn’t actually changed.
- It might introduce database fragmentation. Depending on what datastore you’re using, you might want to limit unnecessary deletions to avoid fragmentation.
Solution: Detecting What’s New
We can avoid these problems by keeping track of which data we’ve already seen and which we haven’t seen. This way, we only update records that have been added or changed. In basic pseudocode, here’s what a data update would look like:
Get all records that are new or changed since your last import For each record: Find the record in your old data If you didn't find it: Add it If you found it: Change whatever bits of it are different For all your existing data: If it wasn't in the source data any longer: Delete it
The three tricky bits here are in bold: getting all new records, finding the record in your old data and changing attributes that are different. In this part of my series, I’ll focus on getting all new records. Specifically, given a regularly updated data set, how do you get only the stuff that’s new since the last time you updated? The answer is: it depends on the nature of the data. You’ll have to figure out what questions the data set is designed to answer and which ones you’ll have to find hacks for.
The First Resort: “Last Updated” Columns
First, figure out whether the data set has a “last updated” column. If it does, then your work is simple—each time you update your data, just grab the records that have changed since the date/time of your last update. Pseudocode for this:
Get all records with "last updated" >= last_update_date Set last_update_date to now
You’ll need to make sure to preserve the date/time of your updates so that you can do this query (“last_update_date” in this example). In case of long imports that take more than a few minutes, I recommend saving the time that you started the import, not the time that you finished it, in case your data source updates records during your own update.
If your data set has this “last updated” field, you’re lucky, and you can skip all of the hacks in the rest of this article. Otherwise…
Other Date Columns
The next thing to try is to find all date columns and determine whether any of them can be used as a “proxy” to last updated, i.e., columns whose meaning suggests “this thing was completed/changed on such-and-such day.”
For example, say you’re dealing with your city’s building permits. In my experience, these data sets usually have some sort of “recorded date” or “received date” field—that is, a date representing when the permit was officially recorded or received by the city. That’s the kind of thing that (1) only happens once for each permit, (2) generally doesn’t change and (3) suggests finality or completion.
Or, if you’re dealing with crime data, you could use the “incident date” or “report date.” For restaurant inspections, you could use the “inspection date.” For city service requests, you could use the “request date.”
If you’ve found an appropriate date field, figure out whether you can retrieve only the records whose dates are recent. Regardless of whether your data is behind a nice API or being screen-scraped, you have two options: look for a way either to search by exact date or a way to sort all records by date. Pseudocode for the former:
/* date_import_1() */ For each date in the last X days from today: Get the records for that date For each record: Process the record
The value of “X” here depends on your data set and how frequently you update.
Pseudocode for the latter:
/* date_import_2() */ Get all records, ordered descending by the date field (newest to oldest) For each record: If you've already seen this record, break Process the record
If your data set doesn’t have an appropriate date field, look for fields in the data whose values are incremental: IDs, report numbers, case numbers, permit numbers…any number that appears to increment any time there’s a new record. Then you can write an importer script that incrementally gets records by adding 1 to the previous record, onward and upward until you come up blank X times in a row. Some tips here before you start writing code:
- Spot check to see how sparse the ID values are. Do they generally increment by 1, such that if you choose a random reasonable value there’s a record there? Or are they sparse (with a lot of missing records between incremental IDs)? If it’s too sparse, it might not be worth using the incremental approach.
- Don’t get fooled by IDs with letters in them; those still might be incremental. For example, building permit IDs “BP123”, “BP124”, “BP125” or inspection IDs “523424X” and “523425X”.
- Sometimes incremental IDs have extra data in them such as the current year. For example, restaurant inspection IDs “2012-434”, “2012-435”. In this case, write your importer in a way that takes the year into account.
If your data source lets you query it and ordering by this ID field, pseudocode would look like this:
/* increment_id_1() */ Get all records, ordered descending by the ID field For each record: If you've already seen this record, break Process the record
If your data source doesn’t let you query by this ID field, see whether there’s a way to retrieve records directly by ID (e.g., “Give me the data for building permit ID 1234.”). If that’s possible, you can use a “shooting in the dark” technique”just grab a record with a known ID, then increment the ID by one and see whether there’s data there.
Note that, using this technique, it’s important to account for possible “holes” in the data; if you increment by one and there’s no record found with that ID, keep trying to increment, as the IDs might not be 100% contiguous. Just make sure to put a failsafe in your code, so that, for example, your program stops running when it comes up blank 100 IDs in a row (experiment with other values than 100). Pseudocode for this:
/* increment_id_2() */ Assign current_id = (some known record's ID, perhaps the last one you retrieved last time) Assign num_holes_seen = 0 While num_holes_seen is less than 100: Try getting the record for current_id If it exists: Process it Reset num_holes_seen to 0 Else: Increment num_holes_seen by 1
Warning: I’ve gotten bitten while using this technique in cases when a data set’s IDs have huge jumps. For example, the pseudocode above will fail if the data set’s IDs jump by more than 100—e.g., if there’s a record with ID 550, no records with IDs 551-700, then a record with ID 701. To get around this, I recommend occasionally searching for ID values significantly higher than your last known highest ID value; it’s imperfect (there’s a reason I call this “shooting in the dark”!) but better than not doing it. For example, you could run this pseudocode after the above pseudocode increment_id_2():
/* increment_id_sanity_check() */ Assign tries = 10 Assign current_try = 1 While current_try is less than tries: Increment current_id by 500 Call increment_id_2() If any records were found: Alert the developer that there's a big hole in the data Break Increment current_try by 1
You could also write some code that automatically alerts you if your importer hasn’t found any new records in X days (which is a good idea regardless).
I’ve had a surprising amount of luck using this “shooting in the dark” strategy for several data sets, such as Chicago tows, Chicago parking tickets and various cities’ press releases.
The Last Resort: All Data
Failing these options, the last resort is to grab the entire data set. Hopefully it doesn’t take too long to download/scrape—otherwise it might not be worth doing.
At this point, you might also want to contact the data source’s maintainer/owner to see whether they can add a “last updated” field, another date field or an incremental ID field (in that order of preference). You never know what they might do for you. Nice people exist in this world.
Changes in Old Records
So far, the advice here has focused on getting new records—records that have been created since your last import. But, depending on the nature of the data set, you might need to handle records that have changed. In my experience, data changes happen often than expected. Government agencies fix typos, add year-old records that had somehow not made it into the system previously, remove duplicates, or remove sensitive or incorrect data.
If you’re building something like EveryBlock that’s focused on what’s new, this may not be a big deal—you can just ignore additions or changes to old data. But if you’re building something that displays aggregate stats (“there were 526 burglaries in 2012”) or lets people search for old data, it’s important to make sure older data is up to date.
You can’t go wrong with the Gordian Knot solution above, if you need 100% accurate data all the time. Otherwise, you’ll need to make a tradeoff between frequency/efficiency of updates and data completeness. What’s more important for your particular application—displaying the most recent stuff, frequently updated? Or giving people accurate historical data? If you’re building something like EveryBlock, you’ll optimize for the former. If you’re building, say, a “show me the history of this building” app, you should optimize for the latter.
If you want to have it both ways, you can try a certain technique that has served me well: always update the newest few records and always update the oldest few records.
For example, my former project chicagocrime.org, an old-school crime mashup, got its data by screen-scraping the Chicago Police Department’s site. That site only included data for the last 90 days’ worth of crimes, and it was a pretty involved/slow scrape. My priorities were (1) update the newest data as soon as it’s available, (2) don’t crash the CPD’s site or get banned for aggressive scraping, (3) update older data however possible and (4) be upfront with people about what data is “fresh” and what data might be outdated.
Given these priorities, here’s what I decided to do. I scraped the site daily in two parts. First, I got all of the most recent crimes (if I remember correctly, it was the last 14 days’ worth). Then, I got all the oldest 14 days’ worth of crimes in the CPD’s system—the ones reported from 76 to 90 days ago. This way, I always had fresh data for what people cared about the most“the most recent stuff”and I “cleaned up” any possible additions or changes in the older data at the last possible moment before the data was removed from the CPD’s site.
As a result, the data for 1-14 days ago and the data for 76-90 days ago was always pretty up-to-date, while the data from 15-75 days ago might have been a little stale and the data from 91+ days ago had absolutely no guarantee of being correct (because it had “fallen off” the official CPD site).
Being Upfront with Users
Whatever you end up doing, you should be clear with your site’s users about the data limitations and gotchas. In the above chicagocrime.org example, I was upfront with users about the freshness, or lack thereof: for any crime older than 90 days, I displayed a note saying it might be incorrect, because the record was no longer on the CPD site for verification. (In this case, they could go to the police department in person to find the most up-to-date information about that crime.) Also, if your app does aggregation or attempts to derive trends or patterns, things get riskier, and you should be particularly open about the shortcomings of your methods.
Any application that repurposes data from another source has an obligation to explain how it gets the data. I highly urge you to be upfront with your users about which parts of the data might be out of date, how often it’s updated, which bits of the data are updated (only recent records? everything?) and any other peculiarities about your process. The more transparent you are about it, the better.
Until Next Time!
That’s it for part 1 of the hairy data-parsing problems series. In the next part, I’ll cover the second hairy part: given a record from your data source, how do you reconcile it with your old data?