Introducing PourOver and Tamper
Client-side superfast collection management from the NYT
This project was documented and released as part of the first OpenNews Code Convening.
Today we’re open-sourcing two internal projects from The Times:
- PourOver.js, a library for fast filtering, sorting, updating and viewing large (100k+ item) categorical datasets in the browser, and
- Tamper, a companion protocol for compressing categorical data on the server and decompressing in your browser. We’ve achieved a 3–5x compression advantage over gzipped JSON in several real-world applications.
We invite you to explore the docs and examples for both projects; we also have some next examples over on the Times’ Open blog. Following is the story of the genesis and development of these projects:
Collections are important to developers, especially news developers. We are handed hundreds of user submitted snapshots, thousands of archive items, or millions of medical records. Filtering, faceting, paging, and sorting through these sets are the shortest paths to interactivity, direct routes to experiences which would have been time-consuming, dull or impossible with paper, shelves, indices, and appendices. But we don’t have many good patterns or libraries for dealing with these collections in the browser. We fall back to the array, the linked-list or the set. Though these collections of categorical data—every item has m of n possible values—have a special structure, we treat them like collections we know nothing about. We inefficiently filter by looping a function on every action, resorting every time. We write a new
“pageRight”,“pageLeft”,“selectByColor” for every new project. Other times, we simply defer the responsibility of collection management to the server, to the backend app. We translate user actions into SQL queries, get a bunch of
ids back, cull, and re-render.
Dissatisfied with this state of the art, we made PourOver as an attempt to standardize an efficient and extensible model of client-side collection management and weaken reliance on server-side collection operations. Even on modern networks with beefy machines, the roundtrip to a backend is irredeemably slow for responsive UIs. Average North American latency is about 40ms. Any action that calls out for a server operation reduces the framerate of our application to 25 fps, at best. A moderately slow query or heavy render action drags the framerate down into the low teens. We firmly believe that if an app or UI doesn’t feel responsive, if an app doesn’t function in the 30–60 ms range, its use begins to feel like a chore. Users aren’t encouraged to explore when every manipulation triggers a half-second pause. With PourOver, the server-trip bottleneck is gone because collection operations are done on the client. The hardest limitation becomes render speed, much simpler to improve upon than the latency of the internet.
The genesis of PourOver is found in the 2012 London Olympics. Editors wanted a fast, online way to manage the half a million photos we would be collecting from staff photographers, freelancers, and wire services. Editing just hundreds of photos can be difficult with the mostly-unimproved, offline solutions standard in most newsrooms. Editing hundreds of thousands of photos in real-time is almost impossible. To give our Olympics editors more power, we created a service called Imago that categorized incoming photos and surfaced an in-browser solution for selection, editing, publishing, and slideshow creation.
Of course, the enterprise was a small disaster, salvaged only by a month of 19-hour workdays, some bizarre collection query language bolted on to backbone, and a custom SQL query generator. Our client application could blaze through photos and construct infinitely complex facets. The server, however, slogged along under the weight of uncachable paging and sorting. In the end, the code had knotted itself into domain-specific jury-rigging. When it came time to build a day-to-day photo editing platform, named MOD, we had to throw all of Imago’s code away.
PourOver is an attempt to abstract-out the difficult, collection-specific problems encountered in both MOD and Imago: How can you index filtering and sorting operations? How do you compose the results of this indexing so that every re-render doesn’t require re-querying? If your collection dynamically updates, how do you recalculate your filtering results, your sorting, and what page you are on? What kind of event system gives you enough hooks to respond to complicated changes in filter and sort state? How do you combine all of these enhancements with data that must be lazily-loaded (i.e. full-text captions for 500,000 photos)? PourOver exposes some of those solutions.
But even if we assume that we have an efficient, sensible library for working with collections of categorical data, how do you get that data to the browser?
The classic arc goes something like this:
- the app was great on localhost(http://127.0.0.1/) with 10 items in the database
- it was awesome on the prod server with an extra-large RDS, even though there were 1k items
- but suddenly! we got a lot of submissions/posts/images and now—the app is sluggish and the data file has grown to 1.5MB
- and it won’t load at all on my phone connected to the Times Square 4G network.
Over the years we’ve tried a variety of optimizations. Being vigilant about gzipping data files was a great start. Partitioning data into a “bootstrap.json” required for initial load and a complete “all.js” accelerated first render. For database-backed apps, we set up paginated JSON APIs so no individual request was ever too large.
But these improvements weren’t without problems. MySQL LIMIT and OFFSET pagination (employed by the major Ruby pagination libraries) is impossible to scale. Once we broke items across multiple pages, we had to devise other ways of describing aggregates—and then needed yet more endpoints to support complex queries (e.g. “state = ‘CT’ and population < 6000”).
As we developed PourOver and began to realize the power of client-side queries, it became clear that we couldn’t paginate; we needed a way to send the full dataset to the client.
We started pursuing two angles:
- progressive loading, so that the initial load only contained essential details and other data could be lazy-loaded
- a more compact data encoding
Most PourOvers with large datasets take advantage of
The essential idea is to separate categorical attributes—which are necessary to build queries and compress easily due to their limited dictionary—from freeform attributes that are more difficult to compress. The initial Tamper load is an encoding of all categorical attributes for all items; freeform, or “buffered,” attributes are only loaded as necessary.
For example, we might load state and population data for all cities, but buffer their name. This way if you filter to “Florida,” you only need to transfer ~400 city names rather than 30,000.
Maximizing Encoding Efficiency
Even with variable attributes factored out, categorical JSON for large datasets can be heavy.
JSON is a strictly character-based protocol: “true”, which could be represented by a binary ‘1’, is serialized as [‘t’,’r‘,’u’,’e’]. Though gzip will normalize these repeated tokens, the backreferences will almost always exceed a single bit.
The big idea behind Tamper is to use the most efficient encoding based on a categorical attribute’s limited possibility space. In the state example, we know that there are 50 possibilities; therefore we can represent each choice in 6 bits or less (50 in binary is
110010). This is similar in concept to the varints used in Google Protocol Buffers, but without the requirement that each entry is an even number of bytes. For a four-possibility space (representable in 3 bits), we can encode 1,000 items in 375 bytes; varints would require 1kb.
Base64 encoding the binary data and gzipping the result further compresses the size. In addition, there are efficiencies we can take advantage of when encoding the contiguous GUID space.
Finally, It’s Open-Sourced
PourOver and Tamper have been in active development for over a year here at the Times. For many months we’ve been planning an open-source release; last week’s OpenNews Code Convening finally gave us a chance to organize the repos, finalize docs and generate examples. We look forward to seeing how these libraries may be useful to you, too.
In the Wild
Our first-ever deployment of PourOver was the 2013 edition of the Red Carpet Project. In previous years, the red carpet roundup employed a more traditional approach to filtering where each filter click reevaultated all filters for all items. But for 2013, our Fashion desk wanted to expand to include all looks in the past decade—and we knew that if we wanted to scale the number of items as well as complexity of the filtering, we’d need to optimize query resolution. PourOver handily achieved 60fps.
From there, PourOver expanded into other categorical applications including:
Even when a piece isn’t interactively filterable, we’ve found PourOver useful in abstracting collection and page management. Our responsive modal, for example, employs PourOver to manage page state. As a bonus, this makes it very easy to add a modal atop a filtered collection, or to interactively add or remove items. In the case of our Live Oscars coverage, PourOver saved us from writing yet-another live poller and collection state manager.
The Tamper protocol has to date only been implemented in internal tools, but soon we hope to release some public applications.