They Are Tweet Zombies!! They Are Followers!!

A follower bot on Twitter

Learning data They Are Tweet Zombies!! They Are Followers!!

Jake Harris on how dead accounts and spambots can mess with your Twitter data mojo

On Twitter, as in the world at large, the dead vastly outnumber the living.

By the dead, I don’t mean users who are actually deceased, but the many various inactive and inhuman accounts on Twitter. Nobody ever truly dies on Twitter, of course. Despite their policy stating that accounts will be terminated after six months of inactivity, Twitter has yet to purge abandoned accounts. As a result, although Twitter’s number of total users has steadily increased past 500 million or more, only about 25% of them use the site in a given month. This means that some portion—and perhaps a large one—of your followers are likely inactive.

The dead are mostly harmless (unless they have a short username that you want), but the spambots are a different matter.

The Long Night of the Living Dead Spambots

Anybody who watches zombie movies or TV shows knows what a nuisance the undead can be. Even the quietest moments will be interrupted by some lurching horror stumbling out of the woods. And once one zombie arrives, many more will hear the noise and follow. This of course is exactly how the worst spambots on Twitter operate, although thankfully for the living, they are only after your attention rather than your brain.

So, first of all, when you’re planning an interactive that gathers and displays tweets, you’ll need to be prepared for attention-eating spambots that try to ruin the fun.

There is also an insidious type of spambot that doesn’t seek out humans at all. Unlike their aggressive cousins, followerbots try to blend in among regular humans. In horror movie terms, they are less like zombies and more like “pod people”, and they only exist because there are people in the world foolish enough to think that follower counts matter and are willing to pay to boost theirs. Some of these bots may also have more clandestine purposes than merely inflating follower counts, sleeping until they will be activated or subtly working to influence aggregate metrics like retweets or trending topics. They won’t hassle you, but they may mess up your numbers if you are surveying metrics like followers on Twitter. Unlike simple spambots, it might be hard to figure out that they are robots, since they will do their best to blend in.

Let’s look at a few situations where bots can be pests to data visualizers—unless you correct for them—and some ways to compensate for their effects on your data.

There are two major types of graphics made based on Twitter data these days: tweets plotted on a map or a chart of tweets-per-minute to demonstrate the intensity of online conversation about a news topics or events. This chart from Twitter showing tweet volume during the 2013 State of the Union is a typical example of the form: a count of tweets per minute with spikes annotated by the likely soundbites causing them.

A tweet-per-minute chart

A tweet-per-minute chart.

These graphics are usually created by matching specific keywords and hashtags like “obama” or “#sotu13.” To associate the spike with a specific event, the chart just notes the time and makes a best guess what event was likely to have caused it a few minutes earlier. This approach seems reliable enough indeed. Yet high tweet volume also feeds into Twitter’s place trends, a search tool for showing the most popular terms worldwide and specific cities. Newsworthy events will often hit the trending lists in multiple spots and many locations worldwide. Many spambots will look for trending hashtags so they can tweet to them and show up in Twitter searches. Then things like this happen.

Spambots invade #NICAR

Spambots invade #NICAR

The image above was taken from a hashtag search during the National Institute of Computer-Assisted Reporting conference in 2013. It’s a conference for data journalists. There was a lot of tweeting, and we briefly cracked the trending topics list. And then the spambots arrived en masse. If you were charting tweets-per-minute for #NICAR13, you would see a spike there and might assume it’s because of a particular panel or event. You can imagine similar invasions of other trending topics.

Incidentally, spambots can also be a nuisance if you are soliciting people to tweet using a specific hashtag like #askNYT to solicit feedback from users. If you don’t moderate those tweets, you are asking for trouble in the form of spambots or pranksters.

Admittedly, the #NICAR13 surge was for a small conference for data journalists. I don’t know if there are significant spambot-amplified spikes on that chart for the State of the Union or other news events. It could be that there are enough human tweets that the effect of the spambots is just mere fuzz on the underlying trendlines. But it bothers me that I don’t know exactly how big an effect the bots might have here. And I can imagine smaller situations like the NICAR one where the bot surge might drive false conclusions about Twitter reactions. You might say, so what? These are pretty silly charts as it is. But I want them to be as accurate in their silliness as they can be.

Followers

Let me just be blunt here: follower counts are largely meaningless on Twitter, yet some people still cling to them as an important metric of influence and popularity. That’s mainly because it’s one of the few we have. And yet, it is such a terrible metric. For instance, imagine we want to compare the social media savvy of several news brands on Twitter by looking at their follower counts. This is a terrible metric for several key reasons:

  • Since as I mentioned above, Twitter accounts aren’t deactivated when they’re dormant, followers will just accrue to news sites because users sign up to Twitter, follow some news sites, then stop using Twitter. So, some of the difference between follower counts might simply be due to one account being older than another one.
  • Followerbots will often pick news accounts to follow and retweet. It sometimes helps to mask the occasional marketing message they mix in or makes it seems like they are active people rather than spambots. They are followers, but they aren’t actively picking one news account over another for its quality.
  • Twitter itself has altered the dynamic by suggesting certain accounts for new users to follow. Much as I would like to believe the @nytimes account is the dominant leader in followers due to our social-media savvy, a large part of that is due to Twitter’s arbitrary decision to put that account on the Suggested User List. Just another among many reasons why follower count is a meaningless metric.

All of which is to say, follower counts are often wrongly used as a proxy metric for an account’s influence. But they are cited nevertheless. In the 2012 primaries, the number of followers of Newt Gingrich briefly became a campaign issue, with a disgruntled former staffer accusing Gingrich of purchasing his followers. So, we decided to look into it. To do that with the Twitter API today would mean calling two API methods:

  1. followers/ids which returns 5000 user IDs (up to 15 requests in 15 minutes)
  2. users/lookup, which offers look-up information for 100 users at a time (up to 60 requests in 15 minutes)

Newt Gingrich has 1.4 million or so followers. Using the API, it would take 280 requests to the followers/id method and 50 requests for users/lookup on each page returned from the followers API request (14,000 requests total). The challenge is spacing them out so that you don’t overdo your request and get blocked for a while. But, once you wait six hours…

The problem was there is nothing to definitively indicate that Gingrich had purchased followers. Many of them were dormant or spambots, but Newt had also spent a long time on the Suggested User List, so that was likely a key factor. A sudden increase in followers on a single day might be a red flag, but Twitter’s API does not provide the time a user follows an account. If we had been tracking Newt Gingrich’s followers from the beginning, we might’ve been able to identify such a surge, but without that we couldn’t say.

The Wheat and the Chaff

Okay, so many Twitter accounts are either dormant or spambots. So what? It might not matter too much if you’re just doing silly charts about Twitter, but the pedantic side of me is pained to read that there were 1.36 million tweets about the State of the Union, and I know that at least some portion of those were not from real people. It’s an annoying problem if you value accuracy—and you should! The problem of fake accounts will plague a variety of visualizations. It seems like there are two ways we can improve this situation.

There are a few commercial companies like StatusPeople and SocialBakers that offer services for reporting fake followers for your account. In general, these services use a few defined heuristics against the user profile of each of your followers to assess if they are fake. Because of the limits of the Twitter API, they won’t analyze accounts with many followers and their analysis is limited to profile information and not follower tweets. Surprisingly, neither of them seems to offer a service for looking up whether a specific Twitter account is a bot.

There may be some more interesting approaches to be explored. To fight email spam, many organizations use and publish DNS blacklists of known offending domains. It seems possible that the identification of spam accounts could be distributed in some way or run as a service. I also wonder if machine learning might be enlisted to track new trends in bot accounts (for instance, there seem to be a lot of them with “bacon lover” in their profiles these days). The firehose seems like a useful way to identify spam accounts as they tweet rather than later after the fact. I would love also to get a measure of how many times a given account has been reported as spam, although that could be exploited as a denial-of-service attack.

No approach will be perfect. There will be false positives, where users are misidentified as spammers, and false negatives, where spammers evade the net. Just knowing the extent of the problem for a given use case would be useful enough. In the same way that polls have margins of error, we should probably consider developing something similar for Twitter visualizations.

About Jacob Harris

comments powered by Disqus