How to Save DNAInfo/Gothamist Bylines
What we know so far about rescuing the destroyed archives of local reporting
UPDATE: A source close to Gothamist/DNAInfo reports that the sites’ data has been preserved, and that the sites’ publisher has been in contact with the Internet Archive about maintaining a long-term archive of the years of journalism currently inaccessible on Gothamist and related sites. Asked about the multi-site content takedown, our source said, “They didn’t think about it. The directive was to send every page to this statement.”
A reporter at the NYT is now confirming that the archives will be handled at some point in the coming weeks, according to a DNAInfo spokes. Until the promised archives are made available—and ideally the original sites preserved at their URLs for at least a temporary period—we recommend continuing to back up data, but writers for the downed sites can probably relax at least a little bit and get down to the business of abruptly finding new jobs. We’ll continue to collect information here as we find it.
The owner of the DNAInfo and Gothamist family of local news websites shut the sites down today, which means that not only are all their 115 journalists out of work, but all their bylines—and all the vital information in their years of reporting—is gone.
If you’re a newly-fired reporter or a pissed-off reader, here’s the advice we’ve collected so far on salvaging as much data as can be saved.
The sites have been archived on at Archive.org. Here are the main links:
As someone who’s done a manual scrape of Archive.org data to try to save clean copies of a journalist’s work after some other shitty publisher destroyed their archives, I can tell you that getting the data off is a giant pain, so this is a worthwhile group effort.
NEW: The Gothamist Archive Retrieval Tool is a super-easy scraping tool that works for any Gothamist-related site and is awesome—just enter a byline and it snags all the Google AMP cached articles credited to that writer.
🚨🚨🚨🚨 @xn9q8h and i wrote a tool that retrieves Gothamist articles from AMP caches! 🚨🚨🚨🚨 https://t.co/tPMBGMVFSk pic.twitter.com/blzWcSLbvf— 😈 (@turtlekiosk) November 3, 2017
NEW: Rhizome’s Michael Connor has made a tutorial for extracting the missing stories with their media using Webrecorder.io.
A quick tutorial from @michael_connor to download & extract high fidelity Gothamist content using @internetarchive and @webrecorder_io https://t.co/zR6ATUfgk3— Webrecorder.io (@webrecorder_io) November 3, 2017
NEW: Paul Ford has come through with a Gothamist-only spreadsheet of linked archived articles—57,000 saved so far.
Thread. I started a list of articles from Gothamist with links to the Internet Archive, 12k so far. https://t.co/XwbQkUA0DV (cont'd)— Paul Ford (@ftrain) November 3, 2017
Emily Crockett has a running thread on getting the data out:
Journalists at @Gothamist @DNAinfo @DCist @Chicagoist @LAist: follow these steps to preserve your own archives:— Emily Crockett (@emilycrockett) November 2, 2017
NEW: Kate-Laurel at Signal has very kindly dropped some bash scripts into the comments below for folks who are downloading.
NEW: Jeremy Singer-Vine’s Waybackpack is a command-line tool that lets you download the entire Wayback Machine archive for a given URL. Super useful in this situation.
There's a Firefox extension called Scrapbook I save all clips on, it retains formatting and images: https://t.co/6BJn4xblKd— (possibly hollow) 🥀 (@waywardfun) November 2, 2017
Is this at all useful?:https://t.co/I62Dm4VcRW— Rainbow Dash Warrior (@XtinaSchelin) November 2, 2017
It requires Ruby, which I don't have, but someone might be able to utilize it.
For photos on @Gothamist sites, use older versions: 2013 has them https://t.co/Loh2Ucwq9C 2017 doesn’t https://t.co/BA6iU2Vv96— Steve Rhodes (@tigerbeat) November 2, 2017
Google cache isn't reliable. Takedowns can occur by request w/in 24 hrs. Easy way to save multiple pgs: curl. Howto: https://t.co/Esa7VK0VTR https://t.co/Trb7oQgoWn— Robin Stuart (@rcstuart) November 2, 2017
Natalie Grybauskas from de Blasio’s press office has a bunch of clips stashed in emails, and Cory Epstein is running a shared Google Spreadsheet where people can add URLs of lost stories so they can be extracted from the Internet Archive.
We’ll update the story as we get more info on saving the lost articles, although nothing replaces the websites themselves.
Editor, Source, 2012-2018.