Protecting Your Sources When Releasing Sensitive Documents

Scrub metadata, redact information properly, search for microdots & more

DocumentCloud view of leaked NSA report.

Extraordinary documentation can make for an extraordinary story—and terrible trouble for sources and vulnerable populations if handled without enough care. Recently, the Intercept published a story about a leaked NSA report, posted to DocumentCloud, that alleged Russian hacker involvement in a campaign to phish American election officials. Simultaneously, the FBI arrested a government contractor, Reality Winner, for allegedly leaking documents to an online news outlet. The affidavit partially revealed how Winner was caught leaking by the FBI, including a postmark and physical characteristics of the document that the Intercept posted.

The Intercept isn’t alone in leaving digital footprints in their article material. In a post called “We Are with John McAfee Right Now, Suckers,” Vice posted a picture of the at-the-time fugitive John McAfee, complete with GPS coordinates pinpointing their source’s location, who was shortly in official custody. In 2014, the New York Times improperly redacted an NSA document from the Snowden trove, revealing the name of an NSA agent.

The first step with any sensitive material is to consider what will happen when the subjects or public sees that material. It can be hard to pause in the rush of getting a story out, but giving some thought to the nature of the information you’re releasing, what needs to be released, what could be used in unexpected ways, and what could harm people, can prevent real problems.

A Checklist for Sensitive Documents

Removing potentially harmful information from documents is difficult. To make it a little easier, DocumentCloud is creating a checklist of what to think about when making a sensitive document public. But even when the material isn’t on DocumentCloud, this checklist can help reporters and news organizations protect their sources, or other vulnerable people, from getting hurt by the materials posted along with a story.

✔ Have you scrubbed the document metadata?

Many modern file formats contain metadata to support popular features. If you’ve used Track Changes, or geotagged a photo, those are both forms of metadata that can continue to exist invisibly in a document which may reveal details about vulnerable people/sources. Beyond those two examples, there are formats of metadata for all modern files, from email headers to ID3 details embedded in every MP3. It can seem daunting, but a search on the formats of the files you have + the word “metadata” can help you find tools to analyze, and if needed, remove metadata.

A few examples…

  • Microsoft Word documents: These documents may contain a few types of hidden information. Here’s a primer.
  • Images: EXIF is the metadata attached to digital photos. There are quite a few free online EXIF viewers, but if you can’t afford to upload sensitive material, you can also view EXIF data on your own machine via these browser plugins for Firefox and Chrome.
  • PDFs: Here’s an overview of PDF properties and metadata. In DocumentCloud’s case, its platform will convert images, Word and Excel documents, and HTML pages into PDFs. In these conversions, DocumentCloud removes the metadata from the original when creating the PDF. However, DocumentCloud currently does not remove metadata from documents uploaded directly as PDFs.

✔ Have you checked for identifiers?

Identifiers may include:

  • Printer dots
  • Watermarks
  • Text/font variations
  • Unusual spacing

Documents can be modified to allow the author to track a document’s life after creation. The oldest technique for doing this is a faint print on the paper—the traditional watermark. With digital documents, variations in text, spacing, spelling, or even phrases, can allow an author to create versions that link back to specific people or groups of people in order to investigate the origin of a potential leak. Additionally, printers can “sign” paper documents, adding physical metadata to documents through microdots printed directly on the documents that are barely visible to the human eye.

Defeating these techniques requires a careful inspection of the documents, looking for telltale signs and modifying the document to obscure its origin. Sometimes, recreating the document may be necessary, but that’s a judgement call that you have to make on a case-by-case basis. Inspection is never foolproof, but spotting and correcting the spacing, spelling, and physically identifying features of a document can go a long way toward mitigating danger to the people who would become vulnerable once a document is published.

Yellow dots on white background

An example of printer microdots.

✔ Have you accounted for other information that could reveal vulnerable people combined with this document?

In considering the newsworthiness of a document, it’s also worth considering what will happen when the public or subjects of a document see that document. Sometimes details that aren’t personally identifying on their own can be patched together with other publicly available information, in articles or public webpages, and reveal identities or unintentional details.

It’s hard to know in advance if this possible, but it’s worth taking some time to consider. Uniquely identifying information—such as geographical or life details—can often narrow down an anonymous person quickly. Harassers (or worse) can find vulnerable people.

⁠⁠⁠✔ Is the document properly redacted?

Documents can contain sensitive content which you wish to redact. These could be addresses, phone numbers, personally identifying information, or information which could reveal a source. There are a number of redaction tools, DocumentCloud included, which will expunge text and visible content in a document. But it is important to understand how your redaction tools work, and to verify the results. It’s not enough to draw black boxes over digital text—the text itself must be expunged from the document.

For example, DocumentCloud will remove a digital page from a PDF, and replace that page with an image snapshot of that page. DocumentCloud will then OCR the image, and use the resulting text in the document. This ensures that there is no way for the text which you wish to remove to become inadvertantly included in your document. In DocumentCloud, you can check the results by clicking on the Text tab in the viewer, as well as checking the Original Document link.

Whatever tool you use, read the instructions in order to double-check redactions before they are in public.

✔ Is the document the minimum needed for the story?

Publishing only what the story needs, in content and context, minimizes the possibility of harm and focuses reader attention on what matters the most.

It’s our hope that by following this checklist, and thinking carefully about how the document will be perceived and used in public, journalists can maximize the effectiveness of the evidence that supports their stories while minimizing the harm to sources and bystanders.

Do you have suggestions for something we’ve missed? Send them along.

More Reading



  • Ted Han

    Ted Han was the lead technologist behind DocumentCloud from 2011 to 2018, a successful project and hosting service used widely for publication of newsworthy documents and for document analysis. Ted has been involved in open source software for 15 years, was lead developer at Investigative Reporters and Editors and taught at Missouri University School of Journalism.

  • Quinn Norton

    Quinn Norton is a technology journalist who likes to hang out in the dead-end alleys and rough neighborhoods of the internet, where bad things can happen to defenseless little packets.


Current page