Features:

The Strange Tale of FCCliefs

From Joke Remix to Research Grind to Quirky, Visibility-Boosting Chatbot


FCCliefs is a twitterbot birthed at the height of the controversy surrounding net neutrality. Like all artificial intelligence, even its earliest stages were fraught with questions about humanity’s future:

If this world’s population hopes to survive, THEN sparkle emoji sparkle emoji sparkle emoji

Fig. 1. A tweet from FCCliefs alpha v0.0.3


In its earliest imaginings, FCCliefs just retweeted everything the FCC tweeted and replaced random words with profanities, and all the links with redirects to non-fcc compliant websites. It was lazy, but so was I (cursingapi.com was on Hacker News that day).

When my pitch was not met with cheers around the newsroom, I decided I could get NLTK to teach FCCliefs how to generate original ideas. I needed an easy, standardized way to interact with the comments submitted to the FCC Electronic Comment Filing System (which I will be calling the FCCECFS). I needed a database.

The original plan was to use the FCCECFS search engine to find all comments and use their service to export the results to Excel.

screencap of the FCCECFS interface which is not so beautiful

Fig. 2. So beautiful.

I quickly learned that queries that returned more than 10,000 results could not be exported. To abide by this limitation, I had to filter my queries by the date the comments were posted. Originally, that wasn’t so bad. But then, John Oliver went on Last Week Tonight and told the audience to “for once in your lives, focus your indiscriminate rage in a useful direction. Seize your moment, my lovely trolls!” and submit their comments regarding net neutrality to the FCC. At long last, my dream of working with Big Data was about to come true.

My filter became a sieve. For two days after John Oliver’s call to action, I had to filter my queries by (united) state. I spent hours downloading 50+ separate Excel spreadsheets from the site, waiting several minutes for each to process, since trying to download more than one at a time would cause them all to fail. At least the territories were pretty much instant. I guess net neutrality isn’t as divisive in Palau?

After many days’ worth of hours, I had finally combed through the vast wasteland of sparkle emojisparkle emojisparkle emoji, and was rewarded with dozens of spreadsheets that looked like this:

spreadsheet screencap

Fig. 3. A grim day in the spreadsheet mines.

Yes, that’s right, the spreadsheet didn’t even include the actual comment, only a link to the PDF. After an abysmal foray into writing a VBA script to concat Excel fills, a long all-caps rant to Toph, father of the original idea for FCCliefs, and hours of copy-pasting Excel sheets together, I was left with a single spreadsheet containing more than 100,000 rows. After all that, it still wasn’t done. But thanks to Mr. Data Converter, turning the spreadsheet into JSON wasn’t a problem. Life is easy when you have programming.

Next, I started programming. I wrote a Python script to

  1. Iterate over the JSON

  2. Parse out the urls

  3. Read text off the PDF using PDFMiner

  4. Dump the text into the db.

def add_to_db(dictionary):
    counter = 124
    for item in dictionary:
        try:
            url = parse_url(item['link'])
            print counter
            print url
            counter += 1
            if not db.comments.find_one({"url": url}):
                text = pdf_from_url_to_txt(url)
                comment = {"type": "comment",
                            "url": url,
                            "name": item['name'],
                            "received": item['received'],
                            "posted": item['posted'],
                            "lawFirm": item['lawFirm'],
                            "date": datetime.datetime.utcnow(),
                            "text": text}
                comments = db.comments
                comment_id = comments.insert(comment)
        except:
            Tracer()()
            print "ERROR ----------------------------------"
            counter += 1
            pass

def parse_url(string):
    return re.findall('(?:http://|www.)[^"\' ]+', string)[0]

def parse_comment(comment):
    #Removes *.txt 
    first_sub = re.sub('.*txt', '', comment)
    #Removes page number
    second_sub = re.sub('[A-Za-z]+[\ ][0-9]+', '', first_sub)
    #Removes blank lines
    final_sub = re.sub('\n+','', second_sub)
    return final_sub

def pdf_from_url_to_txt(url):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    # Open the url provided as an argument to the function and read the content
    f = urllib2.urlopen(urllib2.Request(url)).read()
    # Cast to StringIO object
    fp = StringIO(f)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    parsed_string = parse_comment(str)
    return parsed_string

This script ran for hours and stored 102,482 comments. Well, more than that, if you count when I screwed up the MongoDB configuration and lost everything…twice.

Today, there are over two million comments on the FCC site. In an alternate universe, I have written a script to poll fcc.gov for changes, and the script runs on a dedicated server at Bloomberg HQ. In this timeline, the project fell prey to corporate apathy, and only the occasional vulture circles by to pick at the bones.

I used NLTK to build an n-gram index based off of the comments in the database to algorithmically generate original-ish comments. Results produced gems such as the following:

Please file the following false assertion : The ONLY reason to allow ISPs to enhance and improve their service as a title II common carrier status – consumer gets the better solution , as a United states i have stated. Thanks for your consideration and please do the right choice. Net neutrality is critical to maintaining a free domain ! I would never need fear of throttling for their tiered traffic proposals. I ’d like to see some goofed up compartmentalization of the internet will firmly cement them in my home. I do not drag everyone to get an express

These insights, though limited in semantic clarity, did introduce me to A.L.I.C.E. and rudebot, which would be used in the final version of FCCliefs. I had conceded that I wouldn’t create something useful out of this project, so I tried a less serious idea; Grab a random “if…” or “then…” clause from the comment database and match it with a “then…” or “if…” clause found in a random tweet. These results were more promising (See Fig. 1 again). They also made it clear that FCCliefs needed a profanity filter.

I searched for a list of swear words and all their possible permutations. I used a regular expression to find matches and replace them with [Expletive]. With the introduction of censorship, FCCliefs was finally living up to its ancestry.

All that was left was to bring it to life. After the extensive process of getting FCC comments into a database, it finally sunk in how, despite being public, the comments were still effectively invisible. This was when I realized that the bot could actually be a useful interface to the database, rather than an arbitrary parameter of my assignment (“So maybe you could make like, a… tech parody account for this? Something hype-related?”).

I developed the bot in Node.js for its simplicity, its speed of development, and my familiarity with it. There are great libraries for working with MongoDB, and it’s simple to manipuate JSON in Javascript.

The way FCCliefs communicated with the outside world was by responding to users’ tweets @FCCliefs in the format of “Tell me about [subject]”. A regular expression grabs the subject, the server searches for matches in the database, the script shuffles the matches to ensure randomization, and then iterates through the matches until it finds a sentence containing the query that is less than 140 characters. Finally, the sentence is run through the profanity filter and tweets the comment with a link to the original PDF.

//finds a tweet from the database containing the subject word
getSubjectTweet: function(subject) {
  var deferred = Q.defer();
  var self = this;
  var subjectRe = new RegExp(subject, 'i');
  comments.find({text: subjectRe}, function(e, docs) {
    if(docs.length > 0) {
      self.findTweetInDocs(docs, subjectRe).then(function(response) {
        deferred.resolve(response);
      });
    } else {
      deferred.resolve("I dont know about that. Try asking me about something else");
    }
  });
  return deferred.promise;
},

findTweetInDocs: function(docs, subjectRe) {
  var deferred = Q.defer();
  var sentenceRe = /[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)/g;
  docs = _.shuffle(docs).slice(0, 20);

  docs.forEach(function(doc, i) {
    var sentences = doc.text.match(sentenceRe);
    sentences.forEach(function(sentence) {
      var tweet;
      if(sentence.length < 110 && subjectRe.test(sentence)) {
        tweet = '"' + sentence + '" ' + docs[i].url;
        deferred.resolve(tweet);
      }

      if(!tweet && i === docs.length - 1) {
        deferred.resolve("I don't know about that. Try asking me about something else");
      }
    });
  });

  return deferred.promise;
}

(The code.)

Fig. 4. FCCliefs adhering to Godwin’s law.


If a user tweets something at the bot that is not “Tell me about…”, FCCliefs responds as if it were a chat bot. 85% of the time node.js calls a python script using PyAIML (a python interpreter for Artifical Intelligence Markup Language) that takes the tweet and responds as the chatbot A.L.I.C.E.—a more advanced version of ELIZA, the original chat bot.

The other 15% of the time, Node.js calls a python script that takes the tweet and responds as rudebot. This bot is built into the nltk.chat package, and responds with insults tame enough to pass through FCC censors unimpeded.

var Q = require('q');
var childProcess = require('child_process');

var chatBot = {

  generateResponse: function(query) {
    var deferred = Q.defer();

    var execString = Math.random() <= 0.85 ? 'python alice.py ' + '"'  + query + '"' : 'python generate_rude_text.py ' + '"'  + query + '"';

    childProcess.exec(execString, function(error, stdout, stderr) {
        deferred.resolve(stdout);
      });
    return deferred.promise;
  }

};

module.exports = chatBot;

(The code.)

import sys
import os.path
import aiml
import re

k = aiml.Kernel()
k.verbose(False)

def load_brain():
  files = os.listdir('./aiml/')
  for aiml in files:
    if aiml != "._DS_Store":
      k.learn('./aiml/' + aiml)
  k.saveBrain('alice.brn')

def generate_alice_text(query):
  k.loadBrain('alice.brn')
  response = k.respond(query)
  sys.stdout.write(response)

generate_alice_text(sys.argv[1])

(The code.)

def generate_rude_text(query):
  response = nltk.chat.rude.rude_chatbot.respond(query)
  sys.stdout.write(response)


generate_rude_text(sys.argv[1])

(The code.)

FCCliefs also tweeted out a random match containing “FCC” and a random match containing “neutrality” at a set time interval, in case no one was talking to it. One pearl of wisdom I gathered from working as a news developer is that in order to stay relevant, you have to produce content. Though I don’t know if this is an alief or a belief, now it’s an fcclief too.

Credits

Recently

Current page