Features:

How to Save DNAInfo/Gothamist Bylines

What we know so far about rescuing the destroyed archives of local reporting

Posted on: November 2, 2017

UPDATE: A source close to Gothamist/DNAInfo reports that the sites’ data has been preserved, and that the sites’ publisher has been in contact with the Internet Archive about maintaining a long-term archive of the years of journalism currently inaccessible on Gothamist and related sites. Asked about the multi-site content takedown, our source said, “They didn’t think about it. The directive was to send every page to this statement.”

A reporter at the NYT is now confirming that the archives will be handled at some point in the coming weeks, according to a DNAInfo spokes. Until the promised archives are made available—and ideally the original sites preserved at their URLs for at least a temporary period—we recommend continuing to back up data, but writers for the downed sites can probably relax at least a little bit and get down to the business of abruptly finding new jobs. We’ll continue to collect information here as we find it.

The owner of the DNAInfo and Gothamist family of local news websites shut the sites down today, which means that not only are all their 115 journalists out of work, but all their bylines—and all the vital information in their years of reporting—is gone.

If you’re a newly-fired reporter or a pissed-off reader, here’s the advice we’ve collected so far on salvaging as much data as can be saved.

The sites have been archived on at Archive.org. Here are the main links:

DNAInfo—https://web.archive.org/web/*/DNAINFO.com/
Gothamist—https://web.archive.org/web/*/gothamist.com/
LAist—https://web.archive.org/web/*/LAist.com/
Chicagoist—https://web.archive.org/web/*/chicagoist.com/
DCist—https://web.archive.org/web/*/DCist.com/
SFist—https://web.archive.org/web/*/SFist.com/
Shanghaiist—https://web.archive.org/web/*/shanghaist.com

As someone who’s done a manual scrape of Archive.org data to try to save clean copies of a journalist’s work after some other shitty publisher destroyed their archives, I can tell you that getting the data off is a giant pain, so this is a worthwhile group effort.

NEW: The Gothamist Archive Retrieval Tool is a super-easy scraping tool that works for any Gothamist-related site and is awesome—just enter a byline and it snags all the Google AMP cached articles credited to that writer.

🚨🚨🚨🚨 @xn9q8h and i wrote a tool that retrieves Gothamist articles from AMP caches! 🚨🚨🚨🚨 https://t.co/tPMBGMVFSk pic.twitter.com/blzWcSLbvf
— 😈 (@turtlekiosk) November 3, 2017

NEW: Rhizome’s Michael Connor has made a tutorial for extracting the missing stories with their media using Webrecorder.io.

A quick tutorial from @michael_connor to download & extract high fidelity Gothamist content using @internetarchive and @webrecorder_io https://t.co/zR6ATUfgk3
— Webrecorder.io (@webrecorder_io) November 3, 2017

NEW: Paul Ford has come through with a Gothamist-only spreadsheet of linked archived articles—57,000 saved so far.

Thread. I started a list of articles from Gothamist with links to the Internet Archive, 12k so far. https://t.co/XwbQkUA0DV (cont'd)
— Paul Ford (@ftrain) November 3, 2017

Emily Crockett has a running thread on getting the data out:

Journalists at @Gothamist @DNAinfo @DCist @Chicagoist @LAist: follow these steps to preserve your own archives:
— Emily Crockett (@emilycrockett) November 2, 2017

More tips:

NEW: Kate-Laurel at Signal has very kindly dropped some bash scripts into the comments below for folks who are downloading.

NEW: Jeremy Singer-Vine’s Waybackpack is a command-line tool that lets you download the entire Wayback Machine archive for a given URL. Super useful in this situation.

There's a Firefox extension called Scrapbook I save all clips on, it retains formatting and images: https://t.co/6BJn4xblKd
— (possibly hollow) 🥀 (@waywardfun) November 2, 2017

Is this at all useful?:https://t.co/I62Dm4VcRW
It requires Ruby, which I don't have, but someone might be able to utilize it.
— Rainbow Dash Warrior (@XtinaSchelin) November 2, 2017

For photos on @Gothamist sites, use older versions: 2013 has them https://t.co/Loh2Ucwq9C 2017 doesn’t https://t.co/BA6iU2Vv96
— Steve Rhodes (@tigerbeat) November 2, 2017

Google cache isn't reliable. Takedowns can occur by request w/in 24 hrs. Easy way to save multiple pgs: curl. Howto: https://t.co/Esa7VK0VTR https://t.co/Trb7oQgoWn
— Robin Stuart (@rcstuart) November 2, 2017

Natalie Grybauskas from de Blasio’s press office has a bunch of clips stashed in emails, and Cory Epstein is running a shared Google Spreadsheet where people can add URLs of lost stories so they can be extracted from the Internet Archive.

We’ll update the story as we get more info on saving the lost articles, although nothing replaces the websites themselves.

Credits

Erin Kissane

Editor, Source, 2012-2018.
- OpenNews
- @kissane

How to Save DNAInfo/Gothamist Bylines

What we know so far about rescuing the destroyed archives of local reporting

Credits

Erin Kissane

From our Archives:

Too Human (Not) to Fail

How to Save DNAInfo/Gothamist Bylines

What we know so far about rescuing the destroyed archives of local reporting

Credits

Erin Kissane

Recently

How to tell good LGBTQ+ stories with bad data

7 tips for data-driven journalism about LGBTQ+ communities

Fact-checking in 2024? Five tools to help with research and promotion

Search this site

From our Archives:

Too Human (Not) to Fail