Learning:

Automating Transparency

How I Made CongressEdits


The House of many edits (Jeffrey Zeldman via Flickr)

Sometimes you write a piece of software and it gets used for purposes you didn’t quite imagine at the time. Sometimes you write a piece of software and it unexpectedly rearranges your life. I’d like to tell you a quick story about a Twitter bot named @CongressEdits. It tweets when someone edits Wikipedia anonymously from the United States Congress. In this post I’ll give you some background on how the bot came to be, what it has been used for so far, and how it works. @CongressEdits taught me how the world of archives intersects with the world of politics and journalism. To explain how that happened, I first need to give a bit of background.

The funny thing about @CongressEdits is that it wasn’t my idea at all. Back in July of 2014 I happened to see this tweet go by in my stream:

Tom Scott’s insight was that there is a Wikipedia page for every IP address that has edited Wikipedia, and it could easily be plugged into Twitter. For example, you can see what edits 194.60.38.198 has made here. This page is also available as an Atom feed so it can be used by a feed reader or other software like IfThisThenThat (IFTTT). IFTTT lets you easily funnel data from one service (Facebook, Flickr, Instagram, Twitter, Gmail, etc) to another. Tom created an IFTTT recipe that watched the IP address feed for two IP addresses he knew were proxy servers for the UK Parliament, and tweeted them using the @parliamentedits account if any new edits were found. How did he know the IP addresses? Well from a FOIA request naturally.

Wikipedia

According to Alexa, wikipedia.org is the sixth most popular destination on the Web. Wikipedia is, of course, the encyclopedia anyone can edit, so long as you can stomach wikitext and revert wars. Wikipedia is also a platform for citizen journalism, where events are documented as they happen. For example, the article about Germanwings Flight 9525 that crashed on March 24, 2015 was edited 2,006 times by 313 authors in 3 days.

What is perhaps less commonly known is that Wikipedia is a haven for a vast ecosystem of bots. These bots perform all sorts of maintenance tasks: anti-vandalism, spell checking, category assignment, as well as helping editors with repetitive operations. Some estimate that as much as half of the edits to Wikipedia are made by bots. There’s a policy for getting your bot approved to do edits, and software libraries for making it easier. Wikipedia bots are themselves the subject of study by researchers like Stuart Geiger, since in many ways this is terra incognita for information systems. What does it mean for humans and automated agents to interact in this way? What does it mean to think of Wikipedia bots in the context of computational journalism? Does it even make sense?

While these questions are certainly of interest, to understand the story of @CongressEdits you really only need to know two things about Wikipedia:

  1. Wikipedia keeps a version history of all the edits to a particular article.
  2. Wikipedia allows you to edit without logging in.

Typically editors log in, and any edits they make are associated with their user account. But to lower the barrier for making contributions you can also edit articles without logging in, so called anonymous or (more precisely) unregistered editing. When you edit this way there is no user account to tie the edit to, so Wikipedia ties the edit to the IP address of the computer that performed it.

If you go to Google and ask “what is my IP address” you should see a box at the top with our IP address in it. This is the IP address that Google thinks you are at. Given the way networks are set up at places of work, hotels, etc it’s possible that this IP address identifies a proxy server that filters content for many people on your network. So the IP address seen by Wikipedia may be for your organization, not your specific workstation.

Often spammers and other vandals will edit without logging in. Wikipedia uses these IP addresses to identify pages that have been vandalized, and will sometimes temporarily block edits from that IP address. It’s ironic that unregistered edits are often referred to as “anonymous,” since the IP address says a great deal about where the user is editing Wikipedia from. They add a physical dimension to the internet that we tend to think of as a disembodied space.

CongressEdits

So back in July of 2014, I saw Tom’s tweet and thought it could be interesting to try to do the same thing for the US Congress. But I didn’t know what the IP addresses were. After a quick search I found a Wikipedia article about edits to Wikipedia from the US Congress. A group of Wikipedians had been tracking edits from Congress already, but in a more manual way. I tweeted the IP addresses from the article to some experienced civic hackers I followed on Twitter, to see if they could verify them:

Joshua Tauberer responded with a pointer to the GovTrack source code on GitHub, where he had a similar set of ranges. GovTrack is a government transparency site that aggregates information from government websites to provide easy access to the US legislative record. The good news is that Josh’s list matched the ranges in Wikipedia, and added a few more. The bad news was that the ranges included hundreds of thousands of individual IP addresses. I didn’t know what the proxy servers were in those ranges, or even if there were proxy servers at all—it just wasn’t immediately feasible to watch hundreds of thousands of Atom feeds.

Fortunately, I had previously worked on a very simple application, Wikistream, that visualizes the current edits to Wikipedias in all all major languages. To do this I needed to tap into the edit stream for all the langauge-specific Wikipedias, which sounds difficult, but in fact is quite easy. I learned a few years earlier that the MediaWiki instance behind each language-specific Wikipedia logs into a Internet-Relay-Chat chatroom, and announces all edits there. It is used by some of the previously mentioned anti-spam, anti-vandalism bots to keep abreast of what is changing on Wikipedia. Wikichanges is a program that simply logs into those IRC channels and displays the edits as a stream on a web page. While creating Wikistream I also created a little Node library called wikichanges that bundles up the channel watching and parsing code for reuse.

Here’s an example of a short Node program that uses the wikichanges library to print out the title of each change to all Wikipedias as they happen:

var wikichanges = require('wikichanges');
var changes = wikichanges.WikiChanges(); 

changes.listen(function(change) {
  console.log(change.page)
})

Wikimedia Foundation now also host their own stream service which provides WebSocket, XHR, and JSONP polling interfaces to the stream of edits as they happen. What this means is you can write some static HTML and JavaScript that connects to the stream without having to bother with the IRC chatrooms or running a server of any kind. Here’s an example of a static HTML page that will display a list of edits in the English Wikipedia:

<!doctype html>
<html>

  <head>

    <script src="https://cdnjs.cloudflare.com/ajax/libs/socket.io/0.9.1/socket.io.js"></script>
    <script src="https://code.jquery.com/jquery-1.11.2.min.js"></script>

    <script>

      var socket = io.connect("http://stream.wikimedia.org/rc");

      socket.on("connect", function() {
        socket.emit("subscribe", "en.wikipedia.org");
      });

      socket.on("change", function(change) {
        $("ul").prepend("<li>" + change.title + "</li>");
      });

    </script>
  </head>

  <body>
    <h1>English Wikipedia Edits !!!</h1>
    <ul></ul>
  </body>

</html>

You can copy and paste this into a text file and open it with your browser. In the change callback, each change object that is passed to that function has quite a bit of additional information about the edit. For example, here’s the JSON for an edit to the 2015 military intervention in Yemen English Wikipedia article:

{
  "bot": false,
  "comment": "",
  "id": 727311673,
  "length": {
    "new": 64728,
    "old": 64728
  },
  "minor": false,
  "namespace": 0,
  "revision": {
    "new": 655651590,
    "old": 655651542
  },
  "server_name": "en.wikipedia.org",
  "server_script_path": "/w",
  "server_url": "http://en.wikipedia.org",
  "timestamp": 1428568783,
  "title": "2015 military intervention in Yemen",
  "type": "edit",
  "user": "80.184.65.164",
  "wiki": "enwiki"
}

From this information it’s possible to construct a url for the diff or to talk back to either the MediaWiki API or Wikimedia’s shiny and new REST API for more information about the article that changed.

When I realized that there were hundreds of thousands of IP addresses to monitor for the US Congress, it occurred to me that it would be pretty easy to watch the changes as they come in, and see if an IP matched a range, rather than needing to poll hundreds of thousands of Atom feeds. After a couple hours’ work, I had a short program that tweeted edits that came from the US Congress. I put the code on GitHub and thought a handful of my friends would follow it.

Little did I know…

anon

We’ve all heard about the promise of open-source software. I’m a believer, even though it has been rare for somemthing I’ve put on GitHub to get more than an occasional pull request or bug fix. The initial code for CongressEdits was 37 lines of CoffeeScript. Once this was up on GitHub I quickly got requests to make it configurable for other Twitter accounts, to customize the text of the tweet, provide IPv6 support, and (of course) to allow it to listen to other IP address ranges. Since it was such a small program it was easy to accommodate these requests. I renamed the project to anon, since it was for more than CongressEdits, and then things got interesting. A merry band of sixty or so Twitter bots administered by almost as many people sprouted up, such as:

Jari Bakken, a civic hacker in Norway, quickly put together a historical view of the edits for these bots using Google BigQuery and Wikipedia dumps. The Gitter chatroom for anon proved to be a great way to communicate with other people who were interested in running the bot or contributing to the project.

A few days after I put anon on GitHub, Tom Scott wrote to me saying that @parliamentedits hadn’t tweeted any changes yet, and he suspected that the two proxy servers in his IFTTT recipe were no longer being used by Parliament. He and Jonty Wareing were able to determine that, as with the US Congress, there was a large range of addresses that needed to be monitored. Jonty started up his own anon bot to monitor these Parliament IP ranges, and the original IFTTT recipe was retired. Tom Scott sent a second FOIA request to obtain the IP ranges, but this time it was denied.

I was shocked at how rapidly these bots popped up. I was equally surprised by how many people followed @CongressEdits: in 48 hours it jumped from 0 to 3,000 followers, and rapidly grew another order of magnitude to 30,000 followers.

In the News

Watching the followers rise, and the flood of tweets from them, brought home something that I believed intellectually, but hadn’t felt quite so viscerally before. There is an incredible yearning—in the United States and around the world—to use technology to provide more transparency about our governments. But at the same time, there are also efforts to obscure this access, and to make a mockery of our politics. I recently compiled the top five retweeted @CongressEdits tweets, which I think reflects this range:

retweets tweet text
1347 Reptilians Wikipedia article edited anonymously from US House of Representatives http://t.co/B7VLkhLsb8
759 Senate Intelligence Committee report on CIA torture Wikipedia article edited anonymously from US Senate http://t.co/Bj4q8Naed1
740 Horse head mask Wikipedia article edited anonymously by Congress http://t.co/Ddh98AtAzx
658 Choco Taco Wikipedia article edited anonymously by US House of Representatives http://t.co/QzECJYjf6v
626 Step Up 3D Wikipedia article edited anonymously by US Senate http://t.co/8Cd1HfhUbP

The comparison with the top five @parliamentedits retweets yields similar results:

retweets tweet text
72 Barnett formula Wikipedia article edited anonymously from Houses of Parliament http://t.co/R6HyuF2ZhL
36 Revenge porn Wikipedia article edited anonymously from Houses of Parliament http://t.co/COoJFmwid8
35 List of steam locomotives in Slovenia Wikipedia article edited anonymously from Houses of Parliament http://t.co/V6pjvjtWJp
24 Mosaic (Star Trek) Wikipedia article edited anonymously from Houses of Parliament http://t.co/EqiaT1kf0h
21 Alexis Tsipras Wikipedia article edited anonymously from Houses of Parliament http://t.co/3yqusgcICN

It’s interesting to follow the links to the diff for the change and see how quickly many of these edits were reverted or modified by other Wikipedia editors. In truth I think this is the real value of these anon bots: they provide a focused channel for Wikipedians interested in monitoring a particular class of Wikipedia content. But what’s equally interesting, and perhaps most relevant for Source’s readers, is how these bots have been used as a source and subject for investigative journalism.

Take, for example, the story surrounding the Malaysian Airlines flight MH17, which was shot down near the Ukraine-Russia border. Soon after the crash, the RuGovEdits bot spotted an edit to the Russian Wikipedia article for commercial aviation accidents from an IP address within the All-Russia State Television and Radio Broadcasting Company (VGTRK). The edit replaced text that asserted MH17 had been shot down “by terrorists of the self-proclaimed Donetsk People’s Republic with Buk system missiles, which the terrorists received from the Russian Federation” with text that said it had been shot down by “Ukrainian soldiers.” This soon became news at Global Voices, The Telegraph, Wired, and the Washington Post.

Another story closer to home is an edit that was made to the English Wikipedia article for the Senate Intelligence Committee report on CIA torture from the US Senate:

The article was modified very slightly to remove text stating that enhanced interrogation techniques were a “euphemism for torture.” News of this edit was picked up in the Huffington Post, Mashable, and BoingBoing.

But probably the biggest story to break in the US recently on the topic of controversial edits to Wikipedia was Kelly Weill’s Edits to Wikipedia pages on Bell, Garner, Diallo traced to 1 Police Plaza:

Computer users identified by Capital as working on the NYPD headquarters’ network have edited and attempted to delete Wikipedia entries for several well-known victims of police altercations, including entries for Eric Garner, Sean Bell, and Amadou Diallo.

Instead of using anon, Weill wrote a program to crawl through historical Wikipedia data to identify edits made from the NYPD. Within hours of the news story, civic hacker John Emerson started up an anon bot to make new edits from the NYPD available via the NYPDEdits Twitter account. In an interview with Andrew Lih, a Wikipedia expert and journalism professor at American University, Weill described how she went about this work:

News of the story spread far and wide in The Washington Post, The Verge, NY Daily News, Rolling Stone, The New York Post, The Daily Mail, Essence, Time, and more.

The Apparatus

An important thing to note in these stories is that the bots let us know the edits came from a particular place (VGTRK, US Congress, NYPD), but without further traditional investigative journalism, we don’t really know who made the edits, or what their motivations were. Once @CongressEdits acquired 30,000 followers and individuals inside Congress became aware of its existence, some of the edits seem to be have been made knowing that they would broadcast—the observer effect kicked in. While it’s difficult (perhaps impossible) to spoof an IP address associated with a Wikipedia edit, it is possible that someone could go through the effort if the political stakes were high enough. Unsurprisingly, this technology is not transparency panacea. The same political landscape is replicated and implicated in these bots. They can be manipulated by actors for a variety of reasons once it’s clear how they operate.

I hope this article helped to lay bare the apparatus behind CongressEdits and other anon-style bots. Now that you know how simple it is to access the stream of edits on Wikipedia, I hope you have ideas for similar bots that could perform services of social or perhaps artistic value. Soon after I created CongressEdits, my friend Dan Whaley suggested it would be interesting to be able to observe all edits to articles related to US Congress, and so @congresseditors came into existence. The volume knob on @congresseditors is set quite high since there are so many edits (especially after an election)—but it can be an interesting stream to dip into. Or consider Hatnote’s Listen to Wikipedia project that taps into the Wikipedia edit stream to do just that: listen to Wikipedia. The mundane details of the edit stream can be reimagined, repurposed, and transformed. I hope that this article has sparked some ideas of your own. (If you do put together a bot I encourage you to put the code up on GitHub for others to see. You never know what might happen.)

Life after Bot

You may still be wondering how CongressEdits changed my life. In the interests of transparency I should tell you that when I created the bot I was a software developer working on archival systems at the the Library of Congress. As its name suggests the Library of Congress is part of the US Congress—it’s right next door from the US Capitol Building. I wrote the bot on my own time as a small experiment, not really expecting much of anything to happen. But when it (and I) became the subject of media attention—and the peril and promise of this simple script became apparent—I began to look at my employment in a new light. Where do the documentary interests of archives and journalism intersect? How do automated agents and humans interact in these new information spaces we are building?

I was fortunate to be offered a job at the Maryland Institute for Technology in the Humanities where working on questions like this was encouraged and supported—and was given a place in the UMD iSchool PhD program to study them. So as you write your bots to change the world, I also encourage you to consider how they can (and will) change you, too.

Code

Credits

  • Ed Summers

    I’m a software developer at @umd_mith study web archives at @iSchoolUMD & work on @documentnow // pro-social media at: https://t.co/uNbSDrHZf6

Recently

Current page