Learning:

London Calling: Winning the Data Olympics

Jacqui Maher on wrangling massively complex, really messy data in (almost) realtime


Data emergency preparedness (Interactive News/New York Times)

The Interactive News team at The New York Times took on the challenge of covering the Summer Games in London last year.

Our efforts began with three of us meeting in the newsroom to scope out the project, and it kept going right up to the completely amazing and bonkers closing ceremony in London, a year and a half later. We had to figure out how to start with the XML provided by the International Olympic Committee (IOC), and end up with a website that helped our readers learn and contextualize the important results and news distributed among thousands and thousands of data points.

One of our top concerns was ensuring the validity of results, facts, and figures in a huge set of diverse sporting events at a hectic volume, in as close to realtime as possible. We aimed to balance such granularity with the overall story-telling angle: what do our readers expect to see in our coverage of the event? How do we best present the complex results for each athlete in a timely and accurate way without overloading our website with TMI? You should check out my colleague Tiff Fehr’s recent Learning case study for details about how we solved the front-end challenges to managing all the various visual treatments of results.

I’ll be describing the way the IOC sends data—from athlete biographies to world records broken—to news and media partners before and during the games, how we came to understand and prepare it for Tiff and teammates to render those result table templates. In doing all that, we had to make decisions about what kinds of data to give priority to, based on significance and the level of accuracy we were aiming for in each scenario. I’ll give you some examples to show you the complexity of the data so you can see what we were up against, but I’ll do my best to avoid falling into the many esoteric datapoints that this genre is full of and keep the technicalities brief. Now that I’m working on the Sochi Winter Games coverage, my third Olympics for the Times, I’ll be describing how our approach to covering the Olympics has changed over the years. The lessons our team has learned on these projects about how to manage the challenges and how to adapt over time should be translatable to any complex data problem-set, Olympic or not.

Running and Jumping and Sliding… in XML

Sports in general is big on stats, facts, and figures. Just about any competition that tests the mettle of athletes can be broken down into data points, like personal-best times crossing the finish line of a 5k race, or top career home runs in Major League Baseball. Bringing a sport’s national champions together in international competitions—for instance, soccer’s World Cup—adds more layers of information. And then there’s the Olympics. How much more data is that? Well, in 2 weeks of the Olympics over 204 gold, silver, and bronze medals were awarded after 7,000 competitions to the best of 32,000 athletes from around the world. It took us about thirty thousand code commits to the main git repository to figure out how to show it. :)

The scope of the challenge is enormous. Combining dozens of unrelated sports in a massive international event like the Olympics requires some deft data modeling and wrangling, because it involves an elaborate assemblage of gameplay and advancement rules plus a wide variety of results and record types for 36 sports. The IOC first took on the challenge of representing such disparate sports as the 100m Dash and pairs Figure Skating in a parsable data format for the 2010 Winter Games in Vancouver. The result was an XML-based format called the Olympic Data Feed, or ODF. The feed is made up of several types of messages that describe different categories of information. A “participant” message lists athletes: first and last name, country, date of birth, sports played, etc. The start and end times of each unit of competition are gathered in a “schedule” message for each sport. Once the games are underway, results at every stage of competition are sent in a few different types of messages: unit, phase and cumulative results, pool standings, brackets, and medallists. Tying or breaking World and Olympic records triggers “record” messages in the feed, and these describe the athlete who set the new record and info on the previous one.

Accepting XML

The very first problem we had to solve was how to manage the incoming messages from the data feed. During the games, the XML is sent in the body of an HTTP POST request by the computer scoring systems at each venue. With so many events occurring at the same time in a relatively tight timeframe, our data-ingestion application had to keep up with high transmission rates and quickly determine which messages were the most relevant. It also had to preserve all the data it received, for validation and error-handling purposes. Keeping a copy of the ODF was in our interests for future Olympic coverage as well—we weren’t provided with an end-to-end simulation of what the Olympics looked like in XML, but we would certainly know this by the time the last gold medal was awarded.

For Vancouver, we had built a slim web app called the Listener to be the endpoint for the data feed. It’s written in Ruby using the simple and fast Rack framework—skipping the unnecessary overhead of a bells-and-whistles framework like Ruby on Rails or even Sinatra. This was how we handled this initial step of the process for the Vancouver Olympics in 2010. We took the time to evaluate how it performed then, and sure, we even discussed doing something different, but it worked so well that we decided to go the same route.

Given that some of these messages got up to 20mb in size, parsing each one using an XML library at reception time would slow down the entire process, from the parsing and validating of results to displaying them to our readers. The most important thing was to get the data in the first place. Saving the data on our server filesystems with informative and identifying filenames allowed us to later prioritize and conditionally parse them. As it happened, the first line of every XML message from the IOC—the opening XML node—happened to contain all the info required to generate a useful filename.

The workflow for the Listener was straightforward:

  1. The application waited for an HTTP POST request on port 80
  2. A message gets sent from the Olympics to the Listener
  3. The first line of XML was run through a regular expression that pulled out key datapoints
  4. The body was saved on the server using the generated filename.
  5. Finally, a job was added to the XML Parser queues including a pointer to the message location.

Should I Parse or Should I Go?

The above process was incredibly fast, quickly piling up the XML on the server. Now we had to do something with that data. As I said, we were able to give the files meaningful names informing the right course of action: is this a necessary message? Do we have to parse it or can we skip it? The answer to this question varied depending on the message type and backlog size of our queues.

We learned that we didn’t have to parse every single message to get a full, accurate view of the data. Incremental messages updating a single athlete’s time in a race were immediately followed by a fuller message listing every athlete’s results in that race. Some messages were always mandatory and had to be parsed, like notifications of Official Results. However, the incremental messages could be skipped if the queue got too long, allowing our systems to catch up with less work.

Each enqueued message from the Listener got routed through a dispatching library. This analyzed a combination of attributes, like the document type and event code, the competition status and sport rules, and the parsing queue health. This analysis was used to determine if the message should be parsed, and if so, which of the 24 specialized parsers to use. As you can imagine, a message describing a list of athletes in Volleyball has a different structure than one listing the schedule for Archery.

What Is This XML for, Anyway?

Before we get into each athlete’s data, though, a quick note about how we answered a rather basic question: how do you know where to find this data among the myriad types of messages sent on the ODF? Even during the relative calm before the games, when we were sent the participant, schedule, and record messages I described above, we still had to classify each one. We were able to do this by analyzing the first tag used to open the document: the <OdfBody>. Every message we received on the Olympic Data Feed began with this tag. The contents varied, but in general, the attributes of an <OdfBody> described what the rest of the document would include.

Here’s an example:

<?xml version="1.0" encoding="utf-8"?>
<OdfBody DocumentCode="AT0000000" Serial="250428" Time="170759394" Date="20120712" FeedFlag="P" LogicalDate="20120712" DocumentType="DT_PARTIC" Version="51">
  <Competition Code="OG2012">
  <Participant Code="1083553" Parent="1083553" Status="ACCRED" GivenName="Dayron" FamilyName="Robles" PrintName="ROBLES Dayron" PrintInitialName="ROBLES D" TVName="Dayron ROBLES" TVInitialName="D. ROBLES" Gender="M" Organisation="CUB" BirthDate="19861119" PlaceofBirth="ISLA DE LA JUVENTUD" CountryofBirth="CUB" Nationality="CUB" MainFunctionId="AA01" Current="true" OlympicSolidarity="N">
      <Discipline Code="AT">
      </Discipline>
    </Participant>
    <Participant Code="1004617" Parent="1004617" Status="ACCRED" GivenName="Lyukman" FamilyName="Adams" PrintName="ADAMS Lyukman" PrintInitialName="ADAMS L" TVName="Lyukman ADAMS" TVInitialName="L. ADAMS" Gender="M" Organisation="RUS" BirthDate="19880924" Height="194" Weight="87" PlaceofBirth="LENINGRAD" CountryofBirth="RUS" Nationality="RUS" MainFunctionId="AA01" Current="true" OlympicSolidarity="N">
      <Discipline Code="AT" InternationalFederationId="208762">
        <RegisteredEvent Gender="M" Event="062">
          <EventEntry Code="E_PB" Type="E_ENTRY" Pos="1" Value="17.53"/>
          <EventEntry Code="E_QUAL_BEST" Type="E_ENTRY" Pos="1" Value="17.53"/>
          <EventEntry Code="E_SB" Type="E_ENTRY" Pos="1" Value="17.53"/>
          <EventEntry Code="E_SUBSTITUTE" Type="E_ENTRY" Value="N"/>
        </RegisteredEvent>
      </Discipline>
    </Participant>

I’ll get into more of the details of this message in the next section on Parsing. For now, let’s focus on the opening tag and see what it tells us:

<?xml version="1.0" encoding="utf-8"?>
<OdfBody DocumentCode="AT0000000" Serial="250428" Time="170759394" Date="20120712" FeedFlag="P" LogicalDate="20120712" DocumentType="DT_PARTIC" Version="51">

How does this communicate that the following XML lists athletes competing in Track & Field?

  • DocumentType="DT_PARTIC" ODF messages are divided into about 70 different types. ‘DT_PARTIC’ is a list of participants by discipline. A ‘discipline’ is what we usually refer to as a sport, but is more specific: cycling is a sport, track cycling is a discipline.
  • DocumentCode="AT0000000" the document code was made up of a composite ID whose parts could describe the discipline (aka sport), event, phase and unit of competition. Here the first two letters are ‘AT’, which is the code for Athletics. The more common name for that in America is Track & Field. The rest of the characters are zeros, which are used to indicate no further categorization.
  • Date="20120712", Time="170759394" the message was sent on July 12, 2012 at 17:07:59.394, or 5:07 PM, in London.
  • LogicalDate="20120712" a term that still amuses me (what’s an illogical date?), this is the official day of the Olympics and typically the same as the date above, with exceptions for cases where a competition went past midnight.
  • FeedFlag="P" this flag indicates the message was sent on the production feed, as opposed to a test message that should be ignored (FeedFlag=”T”)
  • Version="51", Serial="250428" we can verify this is the latest version of a document, and that it wasn’t sent out of order, by keeping track of the version and serial values.

Parsing

Our parsers for the London Olympics were very closely tied to the structure of each message type’s XML. Each message was opened, read into memory and converted from XML into Ruby objects we could analyze and iterate over. I will walk you through a few examples and point out the pros and cons of our approach.

In the weeks leading up to the start of the games, the IOC sent us a series of “bulk” messages containing all of the athletes, teams, officials, and, yes, horses participating in the games, along with the full schedule for every single competition across the two weeks. The bulk data also included “historic record” messages detailing a dizzying array of stats: the time; place; sport and athlete with the current world; Olympic, National, African, Oceania, European, Americas, and Asian records; along with each sport’s own set of records, and personal best performances in various international competitions.

Displaying information on athletes competing in any of the Track & Field events—everything from a simple who’s-who list to the medal winners and race results—was dependent on making sense of that XML we looked at in the previous section.

I should point out that, for the sake of brevity (and your sanity), I only included the first two athletes’ data there. The full message contained 10,238 of those <Participant> nodes. The parser had to load this file, then iterate over each of those nodes, mapping all the fields to those in our database table of athletes. Some of the <Participant> contained additional information that we wanted to store, like what sports or even specific events he or she would be competing in.

Finding the Athletes

      <Participant Code="1083553" Parent="1083553" Status="ACCRED" GivenName="Dayron" FamilyName="Robles" PrintName="ROBLES Dayron" PrintInitialName="ROBLES D" TVName="Dayron ROBLES" TVInitialName="D. ROBLES" Gender="M" Organisation="CUB" BirthDate="19861119" PlaceofBirth="ISLA DE LA JUVENTUD" CountryofBirth="CUB" Nationality="CUB" MainFunctionId="AA01" Current="true" OlympicSolidarity="N">
      <Participant Code="1004617" Parent="1004617" Status="ACCRED" GivenName="Lyukman" FamilyName="Adams" PrintName="ADAMS Lyukman" PrintInitialName="ADAMS L" TVName="Lyukman ADAMS" TVInitialName="L. ADAMS" Gender="M" Organisation="RUS" BirthDate="19880924" Height="194" Weight="87" PlaceofBirth="LENINGRAD" CountryofBirth="RUS" Nationality="RUS" MainFunctionId="AA01" Current="true" OlympicSolidarity="N">

This XML introduces Dayron Robles and Lyukman Adams, two stars of Track & Field with quite a few ways of styling their names depending on use. The XML in the feed is generated from the scoring systems on location at Olympic events, so it ends up including things that are really only necessary for displaying on scoreboards at the park and on Olympic TV. I’ll skip over the obvious and leave questions like “What is OlympicSolidarity and why don’t these guys have it?” to the reader.

      <Discipline Code="AT">
      </Discipline>

Moving right along, you’ll notice that the only extra info we’re giving on Dayron Robles is in a <Discipline> tag. It tells us that he’ll be competing in ‘AT’, which we already know as Track & Field. Since we already understand from the <OdfBody> tag that this message is a list of athletes competing in that discipline, we might be tempted to skip this part and keep our parsers light.

      <Discipline Code="AT" InternationalFederationId="208762">
        <RegisteredEvent Gender="M" Event="062">
        </RegisteredEvent>
      </Discipline>

However, take a look at Lyukman Adams. He has a <RegisteredEvent> tag under <Discipline> that show he’s registered to compete in the men’s (Gender="M") event with the code “062.” Remember the DocumentCode composite ID from <OdfBody> tag? The discipline code ‘AT’ is the first 2 characters. The remaining characters are: gender (1), event (3), phase (1) and unit (2) codes. We know the gender and code of the event he’s registered for, so replacing some of the zeroes we end up with ‘ATM062’, the event code for the Men’s Triple Jump.

      <EventEntry Code="E_PB" Type="E_ENTRY" Pos="1" Value="17.53"/>
      <EventEntry Code="E_QUAL_BEST" Type="E_ENTRY" Pos="1" Value="17.53"/>
      <EventEntry Code="E_SB" Type="E_ENTRY" Pos="1" Value="17.53"/>
      <EventEntry Code="E_SUBSTITUTE" Type="E_ENTRY" Value="N"/>

The remaining data tells us that Adams’ all-time personal best (Code="E_PB"), qualifying best (Code="E_QUAL_BEST") and season best (Code="E_SB") scores are all 17.53, and that he’s not entering the jump as a substitute (Code="E_SUBSITUTE" Value="N") for someone else.

So, how did he do?

Results table showing Adams in ninth place

The results

What about Mr. Robles of Cuba, though? How do we know what event he ended up competing in? That brings me to the second-most important messages: the results.

The Results

Fully explaining the various ways results are described and delivered in the Olympic data feed would take incredibly long (and try your patience). Since I’d like to get to the lessons learned, I’ll cut straight to Dayron Roble’s official result. Spoiler alert: he didn’t even finish the race! Here’s how to describe a disqualification in XML:

xml
<OdfBody DocumentCode="ATM012101" DocumentType="DT_RESULT" FeedFlag="P" Date="20120808" Time="212121102" LogicalDate="20120808" Venue="STA" Version="1" ResultStatus="OFFICIAL" Serial="421">
    <Result ResultType="IRM" IRM="DQ" SortOrder="8">
      <Competitor Code="1083553" Type="A">
        <Composition>
          <Athlete Code="1083553" Order="1">
            <ExtendedResults>
              <ExtendedResult Code="AT_REACT_TIME" Type="UER_ATH_AT" Value="0.159"/>
              <ExtendedResult Code="AT_RULE" Type="UER_ATH_AT" Value="R 168.7b"/>
              <ExtendedResult Code="AT_WIND_SPEED" Type="UER_ATH_AT" Value="-0.3"/>
            </ExtendedResults>
          </Athlete>
        </Composition>
      </Competitor>
    </Result>

Translation: Robles competed in the Men’s 110m Hurdles (‘ATM012’) on August 8th, 2012.

Results table showing Robles as disqualified

Results table

Unfortunately he was disqualified. Athletes who place or medal in competitions end up with result types like TIME or DISTANCE, depending on what they were doing. Robles’ result type of IRM is an Invalid Result Mark, specifically a Disqualification (ResultType="IRM" IRM="DQ") according to official rule “R 168.7”: he deliberately knocked over a hurdle. Woops.

Another Games, Another ODF

Now that we’re prepping for Sochi, we looked back at our London work. For the most part, we found that the parsers were able to handle the incoming deluge of XML during the games fairly well. The bulk messages sent before the games were definitely the slowest to ingest, but we had the luxury of time on our side so it wasn’t such a problem. The messages sent during the games tended to be smaller, with larger, slower-to-process ones more the exception than the rule.

The main problems we ran into happened when trying to take the data out and put it into structures suitable for display on the site. The example above of the London Men’s Triple Jump illustrates this pretty well:

Men's triple jump results table

Men’s triple jump results table

We tried to reuse templates as much as possible, and even still we ended up with close to a hundred partials just for displaying those event results tables. (Note: That’s about half the number of different events in the Olympics!) Similar Track & Field events like the Long Jump shared the same template with the Triple Jump, and each event had a qualification round prior to the final. Here, the number of attempts made per athlete differs depending on the phase and type of event. So before we even show the results, we had to look up what the correct number was, here, 6.

Now let’s look at the first row of results: the gold medal winner, Christian Taylor. Figuring out what to display in the ‘Rank’ field involved a lot of lookups and logic:

  • has the competition started? we shouldn’t show ranks or medals before it even begins!
  • is this an event awarding a medal?
  • was a medal awarded already? if so, does it belong to the competitor, Christian Taylor?
  • Taylor won, but did he get a Gold, Silver, or Bronze?
  • If he didn’t win, did he finish? Did he place?
  • What is his rank if he did place?
  • Did he disqualify? Why?
  • What should we display for the disqualification code?

Subsequent fields required looking up the athlete’s country code and mapping it to a flag then picking the correct name format (keeping in mind this is sometimes teams, or even an athlete and a horse). To display each attempt’s value in the Triple Jump, we had to find each top-level result and then traverse through several relational database associations, through to more leaf-node extended data that we stored in redis, then determine in Ruby whether to show a symbol or an actual value. This was because while we did try to mirror the schema of the XML, we also modeled the data somewhat relationally and across multiple data stores.

To better illustrate what I mean, here’s how the XML for Taylor’s performance in the Triple Jump final looks:

<Result SortOrder="1" Rank="1" RankEqual="N" ResultType="DISTANCE" Result="17.81" QualificationMark="">
  <RecordIndicators>
    <RecordIndicator Order="1" Code="ATM062000" RecordType="SB"/>
  </RecordIndicators>
  <Competitor Type="A" Code="1131452">
    <Composition>
      <Athlete Code="1131452" Order="1">
        <ExtendedResults>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_LAST_COMPETITOR" Value="N"/>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_CURRENT_COMPETITOR" Value="N"/>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_ORDER_INITIAL" Value="9"/>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_ORDER_CURRENT" Value="4"/>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_ORDER_4_5" Value="4"/>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_ORDER_FINAL" Value="4"/>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_BEST_ATTEMPT" Value="4"/>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_WIND_SPEED" Value="+0.6"/>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_SPLIT" Pos="1" Value="1">
            <Extensions>
              <Extension Type="AT_SPLIT" Code="AT_RESULT" Value="x"/>
              <Extension Type="AT_SPLIT" Code="AT_WIND_SPEED" Value="-0.2"/>
              <Extension Type="AT_SPLIT" Code="AT_LAST_COMPETITOR_SPLIT" Value="N"/>
              <Extension Type="AT_SPLIT" Code="AT_RUNWAY_SPEED" Value="37.5"/>
              <Extension Type="AT_SPLIT" Code="AT_STEP" Pos="1" Value=""/>
              <Extension Type="AT_SPLIT" Code="AT_STEP" Pos="2" Value=""/>
              <Extension Type="AT_SPLIT" Code="AT_STEP" Pos="3" Value=""/>
            </Extensions>
          </ExtendedResult>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_SPLIT" Pos="2" Value="2">
            <Extensions>
              <Extension Type="AT_SPLIT" Code="AT_RESULT" Value="x"/>
              <Extension Type="AT_SPLIT" Code="AT_WIND_SPEED" Value="-0.1"/>
              <Extension Type="AT_SPLIT" Code="AT_LAST_COMPETITOR_SPLIT" Value="N"/>
              <Extension Type="AT_SPLIT" Code="AT_RUNWAY_SPEED" Value="38.3"/>
              <Extension Type="AT_SPLIT" Code="AT_STEP" Pos="1" Value=""/>
              <Extension Type="AT_SPLIT" Code="AT_STEP" Pos="2" Value=""/>
              <Extension Type="AT_SPLIT" Code="AT_STEP" Pos="3" Value=""/>
            </Extensions>
          </ExtendedResult>
          <ExtendedResult Type="UER_ATH_AT" Code="AT_SPLIT" Pos="3" Value="3">
            <Extensions>
              <Extension Type="AT_SPLIT" Code="AT_RESULT" Value="17.15"/>
              <Extension Type="AT_SPLIT" Code="AT_WIND_SPEED" Value="-0.1"/>
              <Extension Type="AT_SPLIT" Code="AT_LAST_COMPETITOR_SPLIT" Value="N"/>
              <Extension Type="AT_SPLIT" Code="AT_RUNWAY_SPEED" Value="37.8"/>
              <Extension Type="AT_SPLIT" Code="AT_STEP" Pos="1" Value="5.46"/>
              <Extension Type="AT_SPLIT" Code="AT_STEP" Pos="2" Value="5.09"/>
              <Extension Type="AT_SPLIT" Code="AT_STEP" Pos="3" Value="6.60"/>
            </Extensions>
          </ExtendedResult>
        </ExtendedResults>
      </Athlete>
    </Composition>
  </Competitor>
</Result>

I omitted his last 3 attempts, and yes, it’s still pretty extensive. I’ll be skipping over some of it - if you’re curious, check out the Athletics Data Dictionary and have fun :) The <Result> node includes Taylor’s rank of 1 and distance jumped as 17.81 meters. We stored top-level result data in a table in MySQL, with associations out to the athletes (or teams) keyed on participant code. Extended results, which here list the number of attempts made in the competition (<ExtendedResult Code="AT_SPLIT" Pos="1"> is the first attempt), were also stored in a table in MySQL that associated back up to the results. Things get complicated when an extended result itself has extended data - here, the extensions have the actual distance jumped per attempt. These were stored in redis as the volume was pretty high across all sports.

Going back to the rendered view of this data, displaying it required us to:

  • select results for the correct phase (final) of this event
  • sort correctly according to status (the start order before competition, the result order, like rank, afterwards)
  • look up names in associated tables
  • look up extended results for each result
  • look up extensions for each extended result
  • select the value of the extension coded AT_RESULT for each attempt
  • convert it from meters to feet, unless it’s an ‘x’ or a ‘-’, for our American audience
  • highlight the best attempt of the six
  • convert the final result, which should be the best attempt, to feet as well

Imagine doing this in different ways for every single phase of every competitive event, in real time, for men and women, in English. That is, American English, British, Canadian and Australian English, Chinese, Portuguese, Brazilian Portuguese, Danish, Spanish, French, and so on for our syndication partners.

All those lookups, and subsequent rendering, took a lot of system resources and time. Remember, we were trying to deliver results in as close to realtime to our readers. That requires optimal performance in all stages of the process: receiving and parsing XML, validating and formatting the contents and displaying it on the Web site. This was the part we needed to find a new way to handle.

Don’t Be Afraid of Change

When I signed on to work on the Winter Games taking place in Sochi, Russia, in February 2014, I was still feeling battlescarred by the London experience. But also, importantly, still feeling inspired by the Olympics—the amazing feats of the athletes and the incredible challenge involved in reporting it—I did some serious thinking before deciding to take on Sochi. The London team got together and discussed what we might change—or not—for coverage of the next games. This might sound obvious, but it’s so important to review projects while memories are still fresh and learn what you can from the experience.

And, with what we’ve learned from London, we’re changing things up a bit.

For starters, we’re not going to throw absurdly detailed and nested XML data into multiple relational database tables and datasets in redis. We are definitely going to stick to the ODF script like we planned to for London. In fact, we’re staying much closer to it for Sochi. ODF messages sent to us for the Winter Games—and hopefully, for the Summer Games in Rio—are quickly parsed into JSON and indexed in our new Olympic data store: ElasticSearch.

Why?

We did our research and testing and found one name coming up again and again as a solution for easily indexable and queryable data, structured or not, in realtime: ElasticSearch. It’s built to be distributed across multiple servers, even dynamically while indexing data, and it speaks JSON, a format we prefer far more than XML. This allows us to build a sane API easy for developers to work with that natively returns structures in a Web-friendly format. It’s much nicer to work with than the deluge, returning only what’s necessary to generate pages that make sense to readers.

Working on three consecutive Olympics for the Times has given me ample opportunity to consider how to approach a massively complex data project. I’ve walked you through the major issues produced by the sheer volume of data and the speed with which we needed to make it usable, how we thought through those issues to find solutions, what worked consistently, and what we’ve had to adapt along the way. The lessons I’ve learned are ones I’ll take with me on any project, Olympic or not. Be flexible and open to changing gears, but also be open to keeping battle-proven solutions around. Don’t dismiss older technology solutions—like saving to a filesystem—out of hand. Get out of your comfort zone, though, and don’t be afraid to try something new.

Maybe it’ll be the answer you’ve been looking for. Maybe not.

About the Author

Jacqui Maher is Assistant Editor, Interactive News, NYT. Loves the whimsical and the poignant. Dabbles in wordplay.

People

Organizations

Credits

Recently

Current page