The Perils of Polling Twitter
Jake Harris on just a few of the myriad reasons why using tweets as data is less-than-ideal
“But what does Twitter think?”
If you are a data journalist in a newsroom, you will hear this question sooner or later in your career. It doesn’t really matter the context—I’ve heard it asked about everything from the Academy Awards to the Westminster Dog Show to presidential debates or the death of Osama bin Laden. And why not? It certainly sounds like a great idea. Twitter is such a conversational medium, it seems like an easy way to dip into the mindset of the world to see what they think. But as with all great ideas, it’s very easy to go wrong.
I know from experience. Time and time again, I have tried to create an interesting data visualization based on Twitter, only to have it fail for some reason or another. For example, I’ve tried harvesting tweets from the Academy Awards only to have the process explode under the weight of the Twitter streaming API (I don’t even want to talk about how badly my code crashed and burned when Osama bin Laden was killed). In other cases, I actually gathered the data I wanted only to find out the data itself was rubbish; this happened to me when I tried gathering tweets from within the Olympic village.
Looking back, Twitter visualizations usually fail because of one or more of these problems:
- The data I collected was limited or flawed in some way.
- The data does not answer the question I was trying to ask.
- The question I was trying to ask is really not that great.
To be fair, these aren’t problems specific to Twitter. Indeed, you can run into them for wildly different types of data. But, despite my love of Twitter, I do seem to mess up with it a fair amount. And I’m not the only one.
The Frustrations of Streaming
Let’s start with the issue of data collection. This is where I often wind up on the rocks. The promise of Twitter is that it offers several advanced APIs for dipping into the stream of all tweets:
- The filter API allows users to retrieve tweets that match keywords, are posted by specific users, or are geotagged within a specific area.
- The sampling API returns a representative sample of all tweets posted worldwide.
- Finally, the firehose API returns all tweets posted worldwide, but is only available to specific customers.
The streaming APIs certainly are powerful, but unless you spend a lot of time engineering software to use them, I’d recommend using a third-party provider like DataSift or Gnip instead. My own usage of them always ended in failure, usually because my software was unable to handle the rate of tweets coming at me or because I ran into rate limits imposed by Twitter for the filter API (making it impossible to accurately track tweet-per-minute volume). Because the filter API didn’t originally support searching on phrases, I also would wind up gathering many tweets that contained only one word of the phrase I was looking for, and that I would have to discard.
The biggest problem though is that I didn’t have a time machine. The streaming APIs only stream forward. If your code crashes for ten minutes or there was a keyword you forgot to specify, there is no way to retrieve those tweets you missed. It’s extremely frustrating, but especially so when you are trying to explain why the data is incomplete to a senior editor.
Which Public Is It?
Ultimately, every visualization based on Twitter data contains some ambiguity about what it is actually representing. For instance, take this early graphic about what people were tweeting about during the 2009 Super Bowl. Is this a graphic merely about Twitter, or is it a graphic about the US population’s thoughts during the Super Bowl as revealed by Twitter? In this case, it’s clearly the former, but in some cases the distinction is not so clear cut.
The problems with using Twitter as a model for the general population are simple. You don’t have to be a pollster to understand that searching for tweets that match some keywords hardly constitutes proper probabilistic sampling. We might display a map that shows colors mentioned by Americans on Twitter, but nobody would say this is an accurate map of favorite colors for each region of the USA. Naturally, most graphics play it safe and say overtly that they are only representions of Twitter and are not meant to provide deeper insight beyond that into the general population.
However, the distinction is lost on a lot of readers. I think many of us find these graphics so appealing because we see ourselves reflected in our data streams. There’s no harm in that when the subject is Super Bowl tweets or language variety in New York City. But, it gets much more problematic when you are using Twitter as a polling mechanism. A recent Pew Research study found major divergence between public-interest polls and Twitter sentiment for recent political events; for instance, they reported up to 77% percent approval on Twitter for Obama’s reelection compared to 52% approval in a conventional poll. The problems here are twofold:
- Twitter’s demographics are unclear and may differ dramatically from the demographics of the general population.
- Poll questions may elicit responses from people who might have more nuanced opinions they wouldn’t necessarily express on Twitter. Polls will ask questions directly to their respondents; Twitter’s dynamic gives extreme voices on either end of the spectrum greater representation than they would have in a poll.
- The sampling API isn’t the answer here. It might be useful if you wanted to select Twitter users from around the world to poll directly later, but why would you do that? It mainly serves as a good way of checking global Twitter trends like Twitter client usage rather than selecting populations for polls.
Some of the larger tweet-mining firms argued that the Pew study could’ve done some demographic analysis to correct for these issues, but few of us have the data or the experience to be able to do what they do. So, instead it’s easy to get burned.
A Map of Imaginary Places
Most visualizations aren’t global. Most people want to see a visualization of American tweets about the Super Bowl rather than the whole world’s reaction (although a visualization of expat tweets might be interesting too). Even for global topics like the Olympics or the death of Osama bin Laden, the ability to segment responses by geography could be a useful lens for the visualization. Twitter’s API seems to provide two rich opportunities for geographic exploration:
- Users can geocode their tweets at the level of geographic precision they are comfortable with from a latitude-longitude pair up to a city, state, or country.
- Search and streaming APIs provide the ability to only search for tweets contained with a specific area for better recall.
Sounds great, but don’t get your hopes up. It’s hard to find a recent precise figure, but several informal estimates I’ve seen report that only approximately 1-3% of tweets are geocoded. And of those, a sizable minority are automated tweets from services like Foursquare or Instagram. And so, if you are making a map of tweet content, you’re already dealing with a very limited selection of tweets.
The problem is there simply aren’t many geocoded tweets. Because of privacy concerns, users must explicitly enable geocoding for their accounts and also activate geocode for individual tweets. Almost all modern Twitter clients can do geocoding, so it’s no longer an issue of not having the ability.
To widen the net, you could try geocoding locations that users specify in their profiles. These won’t show up in a geographic search though (only for text searches / streaming), and you’ll have to be wary of false positives; be sure to look at the matching confidence returned by the geocoder. You’ll also have to allow for fuzziness when using geocoded locations. They will often specify cities and not be as precise as a single lat-long point coordinate. Sometimes, people will plot such fuzzy locations as points located in the centroid of the the geographical area. Avoid this if you can; this creates a false clustering on the map though and implies a precision that isn’t there.
In some circumstances, users may also change their Twitter locations as a form of protest. For instance, during Iran’s Green Revolution, many Twitter users thought changing their locations to Tehran would help to protect Iranian protesters. This was so successful it made it impossible for any news organizations to visualize tweets from Iran.
Looking at the bio location will increase the number of geocoded tweets you might retain. But those tweets will not be returned by geographic queries against search or streaming. This means you will likely need to do text searches and then narrow down based on if the matches include a location or can be geocoded to one.
If you are doing a map of tweets, you have to accept that it will be limited already; it’s a bit like mapping an iceberg by only drawing the portion above the water. If you are doing a map of tweets that contain hate speech and are geocoded, the problem becomes even worse. This is what a research group at Humboldt University recently tried to do. They released a contentious map illustrating where hate speech was tweeted in America. The map creators rightly touted that they hand-checked every tweet to avoid miscounting situations where subcultures embrace the slurs used against them to weaken their power. But geography still created its own issues.
Over the course of 10 months, the authors collected 150,000 hateful tweets from their analysis. This might seem like a lot, but considering that Twitter users now generate 400 million tweets a day, it’s clear that the intersection of hate speech and geocoding within America is a very miniscule subset of Twitter. Another problem with mapping geocoded data is that you often are just mapping population density. To counteract this, the makers of the hate tweet map aggregated geocoding at the county level and then divided by the total number of geocoded tweets they had for that county. This meant that hotspots are relative to the local rather than the national population, but it also meant that it would take relatively few tweets to color the map in some areas. And so, the Quad Cities area of Iowa shows up as a racist hostpot because of 41 tweets.
So what? This is just a map about Twitter, right? Yet, reading the comments on the piece, it’s clear that many readers are viewing this as a map not just of racist tweets but of racist attitudes in the US. It’s a reasonable assumption on their part. Why would anyone make a visualization that didn’t represent something more meaningful than simply what is being tweeted? This might not be a concern to the visualization’s creators, but it’s something that bothers me. The Guardian hedges their bets with the disclaimer:
The only question that remains is whether the views of US Twitter users can be a reliable indication of the views of US citizens.
But many readers don’t seem to have that concern. The map certainly seems to reflect common assumptions—rural areas are worse, the South has more problems than New England—and yet it’s impossible to say. The inverse scaling they use to not drown out rural counties might be exaggerating the problem there, and just because it matches what we expect doesn’t mean we aren’t suffering from confirmation bias. Such questions are ones you should explore if you want to run a map based on Twitter.
By now, my concerns about polling Twitter for insight should be pretty apparent, but it seems like it’s a problem we’ll want to start solving. There are two possible areas of exploration that might be compelling.
For starters, there is the question of demographics. It’s clear that Twitter differs from the general population in sometimes extreme ways depending on what you are looking for. I am not a statitician, but I’m sure there must be statistical means for responsibly working with such data and assessing levels of confidence. Some guidelines and maybe even statistical libraries would be very helpful for people like me. In addition, there are no regular reports about basic Twitter statistics and demographics to reference. Just a regularly updated statistical digest that includes how many tweets happen in the US on a given day might illuminate how small a subset of all tweets you are working with is.
The other gauntlet is about visualization itself with respect to Twitter. There really haven’t been too many advances in visualizing Twitter since its inception. We either put tweets on a map or we show a chart of “total tweets per minute” that match some criteria. Neither of these approaches feels that exciting to me personally. And as I’ve explained, they can be deceptive and misleading. Are there new ways we can visualize Twitter that would help us make useful, valid interpretations of the data we find in tweet language and tweet patterns?
Jacob Harris is a Senior Software Architect who works with a kickass team of fellow newsroom developers at the New York Times.