The Washington Post’s automated screenshotter grows up
As part of the OpenNews Code Convening held earlier this month, we’re releasing Whippersnapper—an automated screenshot tool to keep a visual history of content on the web. It builds on top of other open source projects to capture and upload screenshots of a web page, giving users creative power to track how the internet visually changes.
We built an early version of Whippersnapper as part of our midterm elections coverage at the Washington Post. Election-night applications can be volatile, with rapidly changing news and surges in traffic. We knew no matter how much time and thought went into our election results infrastructure, something could still go wrong.
Our solution? Create a simplified version of our election night maps that pointed to a static version of our in-house API, and automate the process of screenshotting and uploading those map images. Although we had more sophisticated backup systems in place, this tool would ensure that we would have a live results map even in the worst-case scenario.
While the tool was originally conceived as a last-ditch backup utility, we think it has value beyond election contingency planning. For election night, we considered providing the map images to reporters writing follow-up pieces so they could describe the play-by-play as polls closed and the maps gained color.
Outside of elections, the tool is a simple means of showing how change occurs on the internet—more efficient and precise than manually taking screenshots. It can be used similarly to PastPages, but run on a custom time interval and pointed at any location on the internet.
Whippersnapper could also be used to provide static image versions of computationally expensive interactive graphics, serving these images up to older browsers and low-powered mobile devices.
What We Did
Two weeks later, OpenNews gave us the opportunity to release this tool out in the open. While we already had a complete, functioning piece of code, we knew it would take an overhaul to make it useful beyond its original purpose. To do so, we focused on the following things:
- Reassessing the tool’s goal. We removed all references to elections, not wanting to limit the use cases for the tool. We also made the tool more configurable, adding support for multiple screenshot targets and options such as skipping the upload to S3.
- Refactoring the code. We rewrote sections of the code to be more DRY (Don’t Repeat Yourself) and modular, rediscovering the powers of a sensible language we don’t use enough in our day-to-day work. We scrutinized the names of our configuration options hoping to make using them as intuitive as possible.
- Writing documentation. In addition to documenting the tool’s options, we provided detailed installation instructions and asked several people to test installing and using the package. We included a few sample configurations to help get people started with the tool.
Along the same lines, if the tool was sent to an invalid page, it would take a screenshot of the 404 error message. This turned out to be a bug with Depict—it wasn’t checking response codes of its target page. We discovered another Depict bug that prevented the tool’s delay feature from working properly.
In two days, we turned a quickly written, single-use script into a powerful tool that likely has uses we haven’t thought of.
How it works
Whippersnapper takes a set of web pages and target CSS selectors, defined in a configuration file. In the simplest case, it cycles through that set, repeatedly capturing images of the targets. The tool saves the images with the current time in the filename, allowing users to revisit them in the future.
For our usage on election night, we needed to upload these images somewhere on the internet. Given the proper keys, Whippersnapper can store the images on Amazon S3—including a “latest” version, which displays the most recent image snapshot at a fixed URL. On election night we set our homepage to display these latest images (with cache-busting tokens), making it simple to always show users the most current version of the maps.
Whippersnapper is designed to be a long-running process. It is written in Python, being essentially a script to repeatedly open other programs with the correct arguments. It depends heavily on Depict, a tool to take a single screenshot of part of a web page, which relies on PhantomJS, a headless browser useful for all kinds of automation. (We temporarily considered naming the tool Turducken given the amount of wrapped-up pieces it is made up of.)
Users can run the tool locally or on a server. For our midterm election backup, we used the tool with Upstart, a process that automates jobs and can handily restart them in case they stop.
The Code Convening gave us a great opportunity to get feedback on Whippersnapper and how it could be used. Brian Brennan, a developer who was helping out at the event, suggested that the tool check whether the target web page has changed before uploading a screenshot. We’re hoping to add this feature in the interest of efficiency and preserving disk space.
We’d also like to explore more creative uses for Whippersnapper and how to make the tool more configurable for those uses. If you have ideas for how you can use Whippersnapper—as a reporting tool, backup system or something else—let us know.
Developer at The Marshall Project, previously at NPR Visuals and Washington Post Graphics. Based in Washington, D.C.
Kevin Schaul is a graphics editor at the Washington Post. He graduated from the University of Minnesota with a degree in computer science, though all his professional work has been in newsrooms. He grew up in the Windy City suburbs of Gurnee and Lake Forest, where he developed a deep love for Chicago sports and deep dish pizza. In his free time, Kevin dabbles in photography and distance running.