Don’t Delete Evil Data

The case for archiving online misconduct and abuse

The web needs to be a friendlier place. It needs to be more truthful, less fake. It definitely needs to be less hateful. Most people agree with these notions.

There have been a number of efforts recently to enforce this idea: the Facebook groups and pages operated by Russian actors during the 2016 election have been deleted. None of the Twitter accounts listed in connection to the investigation of the Russian interference with the last presidential election are online anymore. Reddit announced late last fall that it was banning Nazi, white supremacist, and other hate groups.

But even though much harm has been done on these platforms, is the right course of action to erase all these interactions without a trace? So much of what constitutes our information universe is captured online—if foreign actors are manipulating political information we receive and if trolls turn our online existence into hell, there is a case to be made for us to be able to trace back malicious information to its source, rather than simply removing it from public view.

In other words, there is a case to be made to preserve some of this information, to archive it, structure it, and make it accessible to the public. It’s unreasonable to expect social media companies to sidestep consumer privacy protections and to release data attached to online misconduct willy-nilly. But to stop abuse, we need to understand it. We should consider archiving malicious content and related data in responsible ways that allow for researchers, sociologists, and journalists to understand its mechanisms better and, potentially, to demand more accountability from trolls whose actions may forever be deleted without a trace.

The Problem with Relying on Tech Companies to Archive Data

In an ideal world, social media companies would provide this kind of data in anonymized formats for academic or journalistic studies, maybe based on requests or for information that could be of public interest. Government institutions like the Census Bureau or the Bureau of Labor Statistics, for instance, publish microdata—sanitized, representative, and detailed samples of a giant survey—that allows researchers to query it for specific findings.

The reality looks a little different: access to data from social media platforms is often scarce at best.

For one, the kind of data that official channels like API data streams provide is very limited. Despite harboring warehouses of data on consumers’ behavior, social media companies only provide a sliver of it through their APIs (for Facebook, developers can only get data for public pages and groups, and for Twitter, this access is often restricted to a set number of tweets from a user’s timeline or to a set time frame for search).

Then there are limitations on the kind of data users can request of their own online persona and behavior. Some services like Facebook or Twitter will allow users to download a history of the data that constitutes their online selves—their posts, their messaging, or their profile photos—but that data archive won’t always include everything each social media company has on them either.

For instance, users can only see what ads they’ve clicked on going three months back, making it really hard for them to see whether they may or may not have clicked on a Russia-sponsored post. Instagram doesn’t allow any archival downloads.

Last but not least, extracting social media data from the platforms through scraping is often against the terms of service. Scraping a social media platform can get users booted from a service and potentially even result in a lawsuit.

For social media platforms, suing scrapers may make financial sense. A lot of the information that social media platforms gather about their users is for sale—not directly, but companies and advertisers can profit from it through ads and marketing. Competitors could scrape information from Facebook to build a comparable platform, for instance. But lawsuits may inadvertently deter not just economically motivated data scrapers but also academics and journalists who want to gather information from social media platforms for research purposes.

In response to facing data-gathering restrictions, journalists and researchers have had to find creative ways to get some of that data.

Vigilante Data Archiving

One of the most straightforward ways to get the data is by setting up automated scraping mechanisms and archiving it over time. While this approach may be cumbersome and take quite a lot of planning, it can yield great results for longer-term stories like this look at Congress’s first year under president Donald Trump.

There are also experiments that crowdsource some of this data, like Gizmodo’s People You May Know Inspector or ProPublica’s political-ad grabbing Chrome extension. While the sample size and selection of this data is fairly limited to the audiences and peripheral reach of these experiments, they at least are starting to get at a way to gather information that social media companies just won’t hand over.

On top of these project-dependent approaches, a small but growing group of people have taken social data archiving into their own hands. They are like librarians with coding skills who are seeing not just websites but also the millions of data points produced on social media as a trove of information that should be preserved and accessible online.

Most notably, there’s the Internet Archive, a non-profit organization that has been archiving the web—and now increasingly the social web—since 1996.

The Internet Archive has always concerned itself with preserving the web’s ephemera: the Wayback Machine tracks changes on all kinds of websites—from those of government agencies to corporations. A new project is capturing “lower thirds” on news networks—the text crawling across the screen during a TV broadcast—in a searchable database. And, helpfully, the Archive has more recently also started to archive millions of tweets that it gathers on a monthly basis from Twitter’s API.

There are individuals, too, who have taken it upon themselves to make parts of the social web accessible and searchable beyond time limits.

Trump Twitter Archive, for instance, is such a project. Developer Brendan Brown took it upon himself to archive all of Donald Trump’s tweets in addition to tweets from more than 50 other accounts of political figures.

Software engineer Jason Baumgartner created a Reddit archive that currently contains more than 4 billion comments and submissions. He’s built an API to make the data accessible to others, built an interface for people without coding skills to query the data, and has made the data available for download in bulk. He told Source that he has spent $15,000 in expenses related to bandwidth costs and the purchasing of computer equipment to power the API.

More than two dozen papers have been published using his data, he wrote in an email.

“I have always been a strong supporter of data transparency and furthering academic research involving social media topics,” Baumgartner wrote. “The ability to aggregate and disseminate such a massive amount of data presented a unique opportunity to not only learn for myself about social media interactions, but also to give something back to the academic community through the tools previously mentioned.”

Brown and Baumgartner are part of a growing movement of developers and engineers—some work in tech-related jobs, others are former Google and Facebook employees—who are pushing for more data transparency across the web (I like to think of them as a group of data vigilantes).

What’s questionable though is whether these shoestring operations are enough for journalists and researchers to fully comprehend today’s online ecosystem.

The Role of Government

It’s only natural that people will demand more data transparency from social media companies as more and more parts of their lives and social interactions move online. And with constituents demanding more transparency, it’s only logical that governments will play some kind of role in regulating today’s overwhelming and rapidly changing social media data troves.

Governments around the world have already begun to draft and enforce legislation around how tech companies interact with their content and their consumers. There’s the European Union’s right to be forgotten, which requires Google to scrub links about users from search results after a set amount of years so they can be “forgotten.” This year, Germany started enforcing a law that required Facebook to swiftly remove hate speech, fake news, and misinformation. In May, the General Data Protection Regulation will go into effect in the European Union, a set of rules that includes allowing users to know what kind of data social media companies gather about them.

What role could government play in demanding radical transparency from social media companies?

We may be getting the first clue soon. The House Permanent Select Committee’s investigation into Russian interference with the 2016 US presidential election has asked social media companies to release related data. Recently, Twitter has notified users who have interacted with Russian controlled accounts, and the Committee was able to release at least some of the ads bought by Russian political actors, but a complete, searchable corpus of data remains elusive.

What legislators are demanding now is for the entire election-related data corpus to be released. If it does, perhaps we can eventually push for more.

A spokesperson from Facebook said they do share some anonymized data with humanitarian organizations. In the past, the company has supplied anonymized to organizations to help them properly allocate their resources during disasters. But this kind of data access is not often given to academics and researchers, let alone journalists. To them, access to data from social media platforms is often scarce at best. Twitter did not reply to inquiries for comment before publication of this piece.



Current page