How We Built a Lifetime Eclipse Predictor

5,000 years of eclipse data, distilled to a customized human lifespan

The total solar eclipse in 2017 is a rare event; it’s been a century since something similar has happened in America, and we wanted to bring the awe of the event to the reader and encourage them to get excited about it.

The idea for our lifetime eclipse-finder project is based around a widely used NASA database of eclipse predictions. The data is dense (5,000 years worth) and I was surprised that nobody in the media dataviz community has really taken advantage of the dataset, in recent years at least.

Since we knew this was going to be a data-rich analysis, heavy with numbers and statistics, how could we take an already-heady scientific topic and get people excited about it? In this case, our narrative takes advantage of the statistical analysis to spark the light-bulb effect in readers.

By building a story around the rarity of the event, and customizing the data to the reader, the data conveys the gravity of the event, while the design and interactivity keep the piece fun and delightful to play with.

Here’s a rough outline of our process for the project.

Scrape the Data

Fred Espenak is the scientist behind NASA’s eclipse predictions. The site is great, but it’s hard to navigate, and there is no place to batch download the data. French developer and eclipse specialist Xavier Jubier uses this NASA data and makes interactive Google maps with each eclipse on his website.

Jubier also provides downloads of single eclipses as KMZs for use in Google Earth. Again, no batch download available here, so the best I could do was to write a scraper using Selenium in Python to spin up a browser instance, load each URL of the eclipse download, and grab the download.

Get the Data in the Right Format

There’s not much you can do with KMZs. The data is compressed, and you can really only open them with Google Earth. But, KMZs are just compressed KML (Keyhole Markup Language) files. (Did you know that? I didn’t.) In most cases, you can just force the KMZ file extension to be .zip, unzip it, and you’ll have KMLs.

After I did that, I wanted to throw them into QGIS just to see what they looked like. However, with QGIS’s terrible UI, I had to select layers in a dialog box every time I imported a KML. To sidestep this for 3000+ KMLs, I wrote a script just to combine everything into one KML. Thus began my parse script that I built on for the rest of the project.

Parse the Data: Combine KMLs

For this project, the KMLs were just XML files with some special geospatial tags. I treated them like XML files and used BeautifulSoup, a Python HTML and XML parser, to parse the files.

Each KML file corresponded to one eclipse. I took out the elements and combined them into one KML. I also added a KML ExtendedData tag to store the metadata for each eclipse. Then I could look at all 3,000+ eclipses at once in QGIS, and perform year filters.

A GIF of what all the eclipse paths looked like when they loaded in QGIS, colored by year ranges.

Rethink Parsing: Calculate Polygons from Linestrings

The most draining part of the data parsing was calculating polygon shapes from the KMLs. At first, it seemed very simple: The Northern and Southern Limits both had a coordinates element that was a linestring of longitude and latitude coordinates. I figured I could just string these together by reading the Northern Limit as-is and then concatenating it with the Southern Limit read backwards.

This worked for most eclipses, but I realized that I neglected the fact that longitude numbers change signs at the international date line.

Roadblock: The Date Line

A big part of the parsing process was dealing with paths that crossed the international date line.

After asking around, I found a possible solution to my problem in TheSpatialCommunity Slack: Apparently ogr2ogr, a geospatial command line tool, has a wrapdateline command that literally cuts polygons at the date line into two so they don’t render badly across the date line.

This worked for some of the problematic polygons, but not all. Some polygons near the poles, as well as others that crossed the date line multiple times, were still not right.

I decided to write my own script to make all the polygons right. With the help of my colleague, Armand Emamdjomeh, I wrote a preliminary script that detected whether the longitudes changed signs to indicate the creation of a new polygon at the crossing of the international date line. After more tweaks to take into account edge cases, this was, more or less, the basis of our parse script for breaking up the polygons.

In the end, with a deadline in sight, I had to write a few patches for the problematic eclipse paths I still had, but the final script worked for rendering the maps at presentation scale.

Finding Story Angles with Interesting Data: Turf.js and Clipping in QGIS

One obvious angle that we saw was to pull out superlatives in the data: Which city saw the most and fewest eclipses? Again, I turned to TheSpatialCommunity Slack and some suggested I use Turf.js, a JavaScript library with built-in geospatial functions.

To cut down on my process time and data chugging, my colleague Tim Meko helped me clip the eclipse shapefile to U.S. boundaries with ogr2ogr’s clipsrc command, so I only had to deal with a subset of the data.

I created a grid of coordinates to cut the contiguous U.S. up into uniform rectangles so I could measure portions equally. I then used Turf.js’s intersect function and wrote a node script that went through each coordinate, created a square starting from that coordinate, and checked if this square intersects any of the eclipse paths in my data.

A draft analysis using a grid system to check against for the most and least eclipsed areas over the contiguous U.S.(Roughly: red dotted areas have seen relatively fewer than the green dotted areas).

Narrative and Optimization: SHP —> GeoJSON and D3

The spinning globe at the top of the page materialized fairly late in the project timeline. All the graphics were interesting to look at, but we wanted something to personalize the graphic for people and draw them in. Thus the page begins by narrowing down a century’s worth of eclipse data based on the reader’s birth year.

I think this not only encourages shareability but also helps the reader understand the rarity of the event. For example, people might respond to “in my lifetime” rather than “in the next 100 years.” I converted my eclipse SHP to GeoJSON so I could manipulate the data with interactivity using D3, a JavaScript library that has geospatial capabilities. I also used mapshaper as part of my processing to simplify and optimize the SHP for browser rendering.

My editors suggested a globe (instead of an original iteration using a flat map on a Robinson projection). Allowing the reader to interact and drag the globe was also interesting—we could provide worldwide data for a bigger audience while still having a narrative that focused on how rare the Aug. 21 was in the contiguous U.S.

Design and Presentation

The design for these eclipse maps were unlike anything I had really done before. I used the coastline of the contiguous U.S. as a framing device to emphasize the rarity of the event within these boundaries.

But the data inherently isn’t constricted by these boundaries, which meant that the maps were inherently going to be very busy. We wanted the focus to be on the beauty of the eclipse paths, so the labeling on the maps was slowly stripped away bit by bit with each edit. (For example, there were originally state boundaries on the maps that were completely dark purple, making the data hard to see).

This minimalist approach helped make the data the focal point, while also presenting it in an artistic way.

We wanted a coherent look to the entire project, and I think we achieved this through a custom color palette that held everything together, using purples to convey the “shadow” of the eclipse paths and orange as a highlight color.

In Conclusion

Overall, the process was lengthy, but the outcome was worth it. In doing this much processing, I could understand the data really well, and that allowed me to come up with more ideas for graphical presentations of the data that would be interesting to readers.

I think the most important part of this project—similar to most of our projects—was the iteration process. There are many charts and maps that were left out in the end product, but the ideation process for those helped churn up new ideas that eventually became the final maps.

In this situation, curation really helped. There’s a clear narrative that guides the reader through the graphics, as opposed to a bunch of maps thrown on a page. I was sad to not show the entire 5,000 years worth of data in the final product, but in reality, that hefty amount is not really humanly relatable, and humanization of this large dataset for an already complex topic was key.




Current page