Features:

The New York Times’ Election Results Loader

Jacob Harris explains how it was made, tuned and tested


Ohio general election reporting interface

This is an article about a very specific but important part—the election results loader—of the New York Times’ elections coverage online. The initial election results loader was written mostly by Ben Koski and me in the run-up to the general election of 2008. It also served up the election results for the midterms of 2010. This past year, I modified it further to support multiple election dates and some of the unique quirks of each state’s primaries. In the homestretch, I also got some much-needed help from Brian Hamman, Jacqui Maher, Michael Strickland, and Derek Willis. Of course, there are so many more people who worked on the Times’ election site, but their work deserves a better chronicler than I. (Besides, I’m sure all of you will find election loading as intensely riveting as I do.)

The AP Service

Like every other news organization on the planet, we get our election results data from the Associated Press. Google may have had some success providing results from some early caucuses, but nobody matches the depth of AP’s operation. There are roughly two tiers of customers for the AP Election Results: the TV networks who pay lavishly for the timeliest results, and the other news organizations like newspapers that get a slightly time-delayed version via FTP.

For us, the AP provides an FTP server where they update election files on a regular basis. There are three files that specify the election results in a state:

  • The Races file specifies basic race metadata (i.e., this is the Ohio House 7th District Republican Primary) and records how many precincts are reporting at any given point in the night. This file includes both statewide and county-level results (i.e., the Presidential results for Ohio and the Presidential results for Cuyahoga County, Ohio).
  • The Candidates file provides a list of AP candidates and their identifiers.
  • The Results file provides the vote totals for each candidate in a race.

The Candidates file is largely fixed by the time an election occurs in a given state, so it only needs to be loaded once at the beginning of a cycle. Still, this means loading 102 files on a general election night (two for every state and the District of Columbia) containing some 50,000 state and county races. We wanted to run a load roughly once a minute, and just grabbing all those files from the FTP server takes approximately 30 seconds… so clearly, we needed to be fast with everything else. This constraint shaped a lot of our design, but the resulting code enabled several powerful features down the road.

It all starts with race changes.

Track What Changes

Loading…

Once loaded, the AP files provide a complete representation of the election at that particular moment in time. But usually 10% or fewer of those races change between one load and the next. We were concerned that it would unduly stress the database to reload rows of unchanged data within transaction during each load. So, we decided to figure out which races change on any given load, and then only update those. We do this by creating parallel staging and production tables for races and results. Staging is used to load the races for that day. Production includes races for that day and prior elections. We also created a special table (race_changes) for tracking race_changes. Our loading process then runs like this:

  • Create a new Load record in the DB with a unique autoincrement ID
  • Clear all the staging tables
  • Load AP races/results into the staging tables
  • Run the ChangeDetector (a series of SQL queries that look for cases where a race has changed)
  • For each changed race, add a row to race_changes with the race_id and the load_id
  • Update production races/results by copying over from staging all races in the race_changes table for that load

What marks a race change? Generally, there are a not a lot of situations we need to check for. If a race/result is not found in the production table is an obvious example of a race change (i.e., the initial load). But when the loader has been running for a while, it usually is triggered by the following situations:

  • The number of precincts reporting has changed
  • The number of votes for a candidate has changed
  • A candidate has been declared the winner
  • A call for a candidate has been retracted

There are a few other special cases in there (for instance, when we manually call races at the Times, we pick that up as a race change even if none of the other change conditions were triggered), but these change conditions are pretty simple to specify and really fast to check for in SQL.

It’s a little more complex than just loading the AP race data directly into production tables. But it gives us a real speed boost when not much has changed. And it turned out that it made some other cool features simple to implement.

Trade Abstractions For Speed

Object-Relational Mappers are a standard component of most web frameworks these days. They simplify the process of working with databases by mapping SQL records into objects that can be manipulated directly in code. Our team at the New York Times uses Rails for our code which includes the ActiveRecord ORM layer. This makes it simple for us to load state races when it’s time to render a page or respond to an API request. But that abstraction adds performance costs that we sometimes had to bypass in the name of speed. For instance, we know when a race is called by the AP when one of its corresponding results has its winner field set to ‘X’. While this is easier to code in Rails, the performance costs of marshaling objects makes the speed of an ActiveRecord implementation glacial compared to a reasonably complex SQL UPDATE statement. We let SQL do what it does best whenever it makes sense to use it.

Thus, almost every subroutine of our loading process is written in SQL. However, SQL is notoriously obtuse once you add a few JOINs to a query. How do you avoid the almost-certain path to madness? You write a lot of unit tests you can run regularly to ensure the code is correct. We actually wrote the tests first as we developed parts of the loader and added tests with any new features. There are more than 500 tests for the loading system. There could be probably be more.

The Joy of Exaptations

Exaptation is a term from evolutionary biology that describes when a trait evolved for one purpose is co-opted for something else. The feather is a classic example of this; dinosaurs likely first evolved them for insulation only to later use them for stability and eventually flight. In a similar fashion, we found there were several delightful features that were developed as exaptions of our change-driven loading cycle.

Let’s start with the staging tables. Loading the AP data into staging first gives us an easy place to do basic sanity checking of the data and easy error recovery if a single state fails to load. More importantly, it allows us to forcibly zero out races before copying them to production if we need to. One of the quirks of the AP service is that they run tests and live data on the same servers, often waiting until the morning of an Election Day to start “moving zeroes.” We often want to set up race result pages before the AP is ready, but we most definitely do not want to run the risk of posting test data anywhere public. We do this by loading the data in staging and updating a few things like vote counts and precincts reporting to be zero (remember, SQL is really good at mass assignments).

Having a table that records what races have changed on each load turns out to be very useful indeed. We use it in our internal race-calling interface; AJAX requests check for changed races and update the results without reloading the page. We use it to optimize our publisher to bake out new static versions of only the races that have changed. We used it to record a detailed log of the race changes for key races, so we could chart the changes in vote margins during the night. We used it during the primaries to send emails whenever delegate allocations changed.

The Importance of Naming Things

Fairy tales have it right: names are power. In order to find a race in the database, you have to know how to call it in the database. The AP does assign numeric IDs for each race, but these are generally not guaranteed in advance and may even be reused over the course of a year (which is why we append a YYYYMMDD timestamp to IDs in our database). For instance, to find the New York Presidential Republican Primary, you could look for the race with the following conditions: {state_id:’NY’, office_id:’P’, race_type_id:’R’} (republican primary). Change the office_id to ‘H’, the race_type to ‘G’ and add a seat_number:2 and you find the general election for the NY-2 house district. Generally, elections are consistent enough that you can easily figure out how to find a race you are looking for. Except when they aren’t.

Special elections usually muddle things up. This year, there were two elections—one to fill out the remainder of this term, one for the next term—with primaries and a general election for Gabby Giffords’ AZ-8 house seat; both would show up on a search for {office_id:’H’, race_type:’G’}. Scott Walker’s recall election in Wisconsin was coded as a general election by the AP but it shouldn’t show up in the governor race results on November 6th. Even regular elections have their complications. For instance, California switched to open primaries this year where everybody runs and the top two vote-getters advance to the general election regardless of their party (that’s race_type_id:’X’ of course). A week before their primaries, we learned about Ohio’s interesting presidential primary process: every voter votes for a delegate from their congressional district and a statewide at-large delegate. The AP reports this as 17 distinct races for the Republican presidential primary in Ohio. We only want one. These are just a few of many examples. Every state has its own edge cases. This once meant that logic for handling those edge cases wound up duplicated in all the front-end applications that used the AP election results. What a mess.

This year, I decided to try a different approach and added a layer of abstraction: a mechanism for mapping our own NYT race slugs onto AP races. Thus, we map “ny-house-district-2-2012-general” to the AP fields {race_type_id:’G’, office_id:’H’, seat_number:‘2’, state_id:’NY’}. If these conditions match a single race, then we have a successful mapping and can store the AP race ID in the table, binding our slug to that race. Unlike an AP identifier, it’s easy to derive the NYT slug for a race. For cases where the mapping fails because it matches too many races, we can add additional fields to constrain to a single race or manually fix the AP race ID in the database if worse comes to worst.

That is what happened with Ohio. Before the first caucus in Iowa, I had autopopulated NYT race mappings for all the presidential primaries (Derek Willis maintains the NYT politicians/races API and was my steadfast slugmaster for the entire election year). When we started testing for the Ohio primaries, I noticed it was mapping to 17 races. Instead of having to alert all the other developers to patch their code, I just added an additional constraint mapping “oh-president-2012-primary-rep” to {state_id:”OH”, office_id:”P”, race_type_id:”R”, seat_name:”Delegate-at-Large”}, thus mapping the NYT concept of the Ohio republican primary to the statewide delegate-at-large race in the AP. Brief panic averted, I could sip my victory coffee.

This approach is a specific example of a general strategy when working with third-party APIs: place an abstraction layer between your code and theirs. Downstream users of the AP election data should not use any of the AP’s codes or identifiers, letting our intermediary mapping layer do the translation. This approach also provides a nice place to anchor other exaptations where we need to enhance or override AP data. For instance, we have some copy-editing differences on candidate names and ballot initiative titles. We also need to track the incumbent parties of major races so we can calculate gains for each party. The AP does not provide this, but it was trivial to add to the nyt_races table. This even solved a general problem that bothered us on prior elections: how do we mark races that we are interested in showing on the site? The AP election data contains everything from presidential races to town aldermen; we want to only present a subset on the site. Just having a NYT race mapping is a mark that it’s a race we care about.

Election Results As a Service

Election results are just one part of the complete New York Times election site, but they require a large amount of logic and models to support. In the past, we had tried some awkward approaches where two applications would share the same database and we would copy over models for working with the AP results to the election application. That approach created a lot of organizational headaches, however. For this election, we decided to take the bolder step of keeping the election_results application completely separated from the application powering the election site, providing data only through a JSON API. In other words, we build Election-Results-as-a-Service (ERaaS).

This approach worked much better. But it requires some additional effort and good communication between the creator and consumers of the API to work. You want to avoid situations where developers who will be using the API are forced to wait for you to implement the API before they can start working.. Before we worked on major API endpoints we would often manually build the JSON we expected it to produce for a single call. This could then be loaded on the client side by a simple stub, letting the API users wire up their endpoints against real data while the service was built. In addition, we set up the election_results application on a staging server. When developers worked on the election site on their laptops, the development code would default to making API calls against the staging server (API calls could be made against localhost or even production API servers by setting environment variables locally). This allowed new developers to get working on the election site without also having to setup election_results locally on their machines.

Election results are highly nested entities. Each state race usually has multiple results for each candidate and some other associated data. When you want to render something complex like a results page for a state, you have a choice: do you make many small queries or one large query that encompasses the data you need? The REST approach for APIs jibes well with the resource approach in Rails and argues for many smaller API requests, both for architectural coherence and execution speed. This puts a lot of work on the clients to figure out what they need and fetch it efficiently in parallel, so we decided to go with large API responses that encompassed them all. So, when you look at a page like the Senate Big Board, it’s built from two API requests: one to render the tally at the top and one that provides details of every race. In many cases, these API requests could take long times to execute. Downstream web caches like Varnish can smooth traffic a bit, but you still need cache misses to execute efficiently. I spent some time fretting about JSON generation performance, and then I cheated. In most requests like “give me all major races for California,” the API response is essentially an array of JSON representations of individual races (although it may have some metadata up top). I can already use the nyt_races table to get the list of races to render and I can use that to pull the race data from the AP tables if I need to render the JSON. Instead, I use the NYT Race ID as part of a key for storing a cached representation of the race JSON. And since I know when a race changes, I can keep the cached JSON indefinitely and regenerate it as part of the loading cycle. Through further exaptations I was able to cut down a 10 second API request to 10 msecs. Cheating never felt so good.

Admins for Each Audience

Admin snapshot

So far, I’ve discussed the election_results loader and how we share results data with the election site. There’s one other component of the election_results application that bears mentioning: the internal admin screens used by the newsroom. The most important of these was the calling interface. Although the AP makes its own race calls, we prefer to manually call major races (although we are happy enough to autocall minor races following the AP). While our operation will never approach the scale of the networks’ heavily-staffed calling desks, we were able to more effectively call all the primaries and the general election through a view of the election data used only by the two editors who made all of the calls. Similarly, on election night, another team used a custom admin tool to record the race calls made by the networks (so we could show them as part of our coverage). Another admin existed for editing race mappings and copy-editing names.

And of course, there were other applications with their own admins. At various points during the election cycle, I tweeted screenshots of our admin screens. This is not because they were always beautiful (though I am proud of the calling interface), but there is something right about showing the workings of the mechanism, even when it’s not working perfectly. It’s a bit like revealing how the magic trick worked or giving a tour of the tunnels under Disneyland. Which is why I’m excited about Source and honored to have contributed a piece to it.

People

Organizations

Credits

Recently

Current page