And Remember, this Is for Posterity

Archives at the New York Times

Learning data And Remember, this Is for Posterity

Jacob Harris on the hows and whys of designing interactives to survive the future

The Web celebrates the ephemeral. It’s a hoary cliché that the Internet annihilates geography, but it also doesn’t care much for history. We laugh about the days when we used to have Friendster accounts and use flip phones, but that was only 10 years ago. All of that is gone now. That’s the Internet. We focus on the next big thing, launch, disrupt and then expire our once-beloved projects when they’re no longer worth maintaining. Thus, it’s hardly surprising that it’s easier for me to read an issue of the New York Times from 1851 than the election results from 2000. So what? Websites expire all the time, but we’re journalists. We like to think our work is for the ages.

I’m not the first or the only one to notice this. Indeed, Matt Waite has already written an excellent piece called “Kill Your Darlings” on the important need to think about how we end our projects before we begin them. If you haven’t read it, go ahead now. I’ll wait.

Okay, welcome back. Matt’s article highlights a key point: as developers, we are often only thinking to the next milestone and slightly beyond. Definitely not into the next year. Or 20 years from now. That’s also true of traditional narrative journalists. The good ones are often only writing for their next deadline. And yet, their work is perfectly designed for posterity. English changes but at a far slower rate than programming languages. Paper will crumble eventually, but any pile of Zip Disks lurking in old desk drawers testifies that print is more durable than many digital formats. While individual sections and design specifications may change, the newspaper format has been largely consistent for years. Thus, it’s for the most part possible to read a newspaper from 50 or 100 years ago. Some of the context may not make sense, but you can read it.

Death Is Not the End

So, what are the steps to take when it’s time to mothball an application? The key is to remember that nothing is more resilient for the future than a static page. If your application was in a dynamic framework, the first step is to crawl it and save static version of all pages. In some cases, this may be as simple as running wget, but for sites that are not readily indexed, that might not be possible. An alternative approach would be to figure out the routes in your application and to iterate over all possible objects. In either case, it’s important to not forget to also save javascript files, stylesheets and such. You might be tempted to use a third-party CDN like Google’s hosted libraries or CDN JS, but that also makes your site vulnerable if that service ever shuts down. Some web applications may also load elements via JSONP callbacks, and it’s easy to forget to save those.

Search is another question. If you have a small site, it might be simple enough to just disable search and use index pages (ie, select a state or schools that start with A), but large sites are unusable without search. So, your best options are to either rely on a third-party service like Google CSE or to perform the search in the client via javascript (that approach would likely involve generating an index that would have to be loaded on any page). The key to mothballing is that the final product should never connect back to a server to do things like search, pagination. You have to assume it will be hosted on a web server that can’t run any scripts or connect to databases.

Unmarked Graves

Death is often worse for news applications, precisely because our work often stands apart on the sites that employ us. Almost any news programmer generally loathes their organization’s Content Management System; its codified formats and rigid workflows often feel more like strictures to our project. And so, we do our work outside the CMS, skinning our pages so they look like the main news site while remaining architectually apart. For instance, look at our how we reported election results in 2012. It’s actually hosted on Amazon S3 and skinned to look like New York Times content. Why go through this extra work just to make it look like articles produced via the CMS in the end? In our case, controlling our own technology stack enabled us to do dynamic projects like election results that wouldn’t be possible within the CMS. Also, the CMS model for stories is a foolish fit for data projects that may include many thousands of browsable pages; you just can’t and shouldn’t represent a relational database in a CMS. So, we do our work outside the bounds of the CMS, but it has a cost.

The New York Times has an advanced and bespoke CMS called Scoop that is used for composing all aspects of the New York Times website. Currently, Scoop imports articles from the print CMS that governs the physical print newspaper, but the plan is to soon invert that into a “web first” workflow where all articles are composed in Scoop before being laid out for print. Scoop is tightly integrated with the website and the newspaper. It is what web editors use to classify documents against the proper taxonomies and to rank articles on the homepage and section fronts. When stories are published, they are automatically syndicated to partners, published into the appropriate RSS feeds and added to site search. Stories also flow quickly into web search engines like Google and products like Lexis-Nexis. Of course, print articles also are distributed in a reasonably durable form to subscribers, some of which include libraries that also get the newspaper in microfilm format. Other news organizations have different CMSes, but the general components of each infrastructure are similar: importing, syndication, indexing and archiving.

Narrative journalists rarely think about this infrastructure. It’s just there for everything they write, because everything they write goes through the CMS and there are strong archival and financial reasons to syndicate, index and archive that content for posterity. But, then there’s us data journalists. Remember, we decided to pitch our tents outside the CMS so we can build exciting and new types of interactive website experiences. Which often means that our work is invisible in this greater world. It doesn’t show up in site search. It doesn’t show up in Google News. It isn’t rankable on the homepage. Our projects look like they belong to the website, but they are also fundamentally apart and often invisible when running. When they are mothballed, they can vanish almost completely.

So, what is to be done? You need to make some friends and leave your little fiefdom:

  • Find the developers on the CMS team and talk to them. * If your company has indexers and archivists, talk to them too.
  • Target important aspects of the website ecosystem.
  • Figure out where to bury your projects when they’re dead.

You will likely have to tackle integration in fits and spurts. Most CMSes are not monolithic, but this is actually an advantage. You may be able to add your content directly to the site search index or syndication workflows without having to interact with the core CMS software. There will likely be some strange workarounds in your future; it’d be nice if the CMS team gave you a direct API to call, but if your code breaks the CMS at 3 AM, you’re not the person who will get the wakeup call, after all. Finally, see if you can bring your content into the organization as pages. We often build our sites on separate servers like Amazon S3 or EC2, but whenever someone forgets to pay the bills for hosting, those sites will vanish. And we want them to stick around for a long time, even if they are only static versions of their earlier glory.

Rage Against the Dying of the Light

My discussion of posterity seems ludicrous when applied to web projects. Do I really think it’s possible to preserve a web interactive for a hundred or more years? Yes, I do think it’s possible – eventually. But for now, I would like to see interactives last more than five years even. It’s surprisingly hard to plan for the future of the web. For instance, this election site from 2008 seems to have held up well, but opening it on an iPad reveals that much of the site disappears if Flash is not installed. Sites based on Java fare even worse. I would like to think that modern websites based on web standards like HTML5 will better survive time’s bending sickle, but this confidence is likely misplaced. For instance, what if our page relies on a javascript function that’s deprecated in the future and removed soon after? What if Javascript itself falls out of favor, supplanted by a new technology and eventually dumped by all browsers on the Galactic Hyperweb?

More likely though, our sites will fail through dissolution rather than incompatibility. Modern web pages are built from many requests: pulling HTML from the web servers, javascript libraries and stylesheets from content-delivery networks (CDN), data files from other API endpoints or edge networks. All it takes is for a few of those dependencies to break and a clever example of interactivity can become unusable. Link rot is inevitable. I sometimes like to look at the New York Times’ Hyperwocky website to reminisce about the goofiness of the Web in the late 90s. The site is still readable, but its context is a shadow of itself – most of the links are dead. Serious links also die. A recent study was surprised to find that 49% of links within Supreme Court opinions no longer work either. In our own projects, link rot might have similar bad effects. It might cause scripts or stylesheets to no longer load; it might make contextual links like “for more detailed analysis, click here” fail; it might even make us think our old sites work only to find out we are so very wrong.

The difficulty of this exercise is that the future remains stubbornly unpredictable. Static mothballed versions of our sites will work best for the short term. But should we be thinking further down the road and create more basic versions of our content in the hope that it’ll last longer? Should we be archiving our content in a more durable way? The Internet Archive already has been, to some extent. In addition, the Library of Congress has been heavily involved through its National Digital Information Infrastructure and Preservation Program. Much of this material is only riveting to digital archivists, but they’ve put some thought into what digital formats will age the best for posterity. Among these is a specification for encapsulating projects into a single web archive (or WARC file) that would at least ensure internal link consistency. The Internet Archive is already using these to represent crawls of site, but it might be useful to consider this format for manually snapshotting our own sites at various newsworthy moments or for ensuring a complete archive of large interactive sites with thousands of pages. And we might consider producing “legacy” versions of our projects with minimal to no javascript, maps baked out into static images, etc. if we were really serious about longevity.

This may seem absurd and it probably is. If it were part of the requirements for a site that it had to be functional for 20 years after it was decommissioned, we probably wouldn’t bother. And for many light things we do like “send in your dog photos,” it would be overkill. And yet, we do also cover hard news like elections or the Olympics or serious investigative pieces. Shouldn’t we do more to ensure our work is there for future historians rather than just ceding that to whatever appeared in a newspaper the following day?

What’s Next

While I was writing this piece, I realized really quickly that I was in over my head. As a developer, I simply do not have the mental framing to think like an archivist does, and I doubt I’m alone in that regard. Looking into those websites and standards, I was confused by all the jargon. As a developer who regularly quotes technical acronyms and the Hacker Dictionary, I am aware of the irony in this. Several organizations already have defined programming style guides; maybe we should consider some archiving style guides too? This is something we could work with archival organizations to develop, and as Matt Waite’s piece shows (if you haven’t read it by now, please do so), it’s a lot easier to plan for posterity in advance than when the project has ended and people have moved on to other things.

Due to the varying capabilities of different web browsers, web designers early on learned to code their sites to support graceful degradation, where the app regresses to a more limited but still usable state if certain functionality is not available. This has since been supplanted by the concept of progressive enhancement, where sites are designed to work for a baseline first and functionality is added for more advanced browsers that can support it. These concepts may seem similar in execution, but they are derived from different philosophies and assumptions on how users will upgrade their browsers or what they support. For instance, the rise of mobile devices negated the assumption in gradual degradation that browsers will get faster and more advanced with time. Thinking about our sites in terms of degradation or enhancement seems like an excellent basis for future compatibility. Will there be a time where we can assume that browsers are much faster, but they lack compatibility for some of the standards we take for granted today?

We will also need to build tools. Django already has the excellent Django Bakery plugin for baking out dynamic sites into static pages, but there is no equivalent solution for Ruby on Rails or some other web frameworks. We also need better tools for verifying that web archives are internally consistent and not missing any files including stylesheets or JSON loaded by scripts. It’s not glamorous work, but it’s specific and well-suited for well-organized minds who have the methodical skills I personally lack.

This article is part of a Guide: The Care & Feeding of News Apps

About Jacob Harris

comments powered by Disqus