Gender, Twitter, and the Value of Taking Things Apart

Jake Harris reverse-engineers Twee-Q to evaluate its use of data (and see if his ratio is as disappointing as Twee-Q says it is)

Jake Harris’s Twee-Q score

It’s impossible to deny there is serious gender inequality the world of journalism. Similarly, the world of computer programming is marked with even more serious skew in its gender balance. Data journalism is firmly seated in the intersection between those two fields. Where does that leave us? In a state of eternal vigilance about its gender balance and diversity as it should be.

This isn’t an essay about how to address gender inequality in digital newsrooms. I don’t claim to know the answers and I expect others can provide better guidance than me. For me, one essential first step is the simple act of making people aware that gender imbalances are real and have consequences on who is heard. Some developers in Sweden created Twee-Q to address awareness of the gender problem for Twitter users. The interface is pretty simple. You submit a Twitter account for analysis (yours or possibly someone else’s) and it scans that account’s most recent 100 tweets for all its retweets. It then checks the names of the retweeted accounts to see which are male or female (ambiguous names are ignored). Once it has a count of male and female retweets among the last 100 tweets from your account, it can tell you how much you unconsciously prefer retweeting one gender over the other in your tweets, in the hopes that you’ll try to rebalance who you follow and be aware if you are favoring one gender’s perspective on the world.

Naturally, I tried it out on myself. Oof. The screenshot above shows how dismal my own score was. I now knew I could do better. But I also immediately wanted to figure out how Twee-Q worked. Of course, this was somewhat in the hope I could identify some fault in their approach to blame for my own poor performance. But also I was interested in rebuilding it because it makes the same choices any other programmatic analysis of Twitter does: finding a balance between speed and accuracy. This project is an object lesson in the thrills and pitfalls of using tweets as data—and in the value of reverse engineering the creation of data as a way of evaluating its validity.

The Tweequality Application

Which is ultimately why I built my own version. So I could understand their design decisions and explore if they affect the final analysis.

As someone who is slightly obsessed with Twitter data, I know a lot about this. Twitter has always been notable for its extensive API access, but the service’s growth has also necessitated that the company place limits. In the early days, scripts were only required to log in if they were performing “write” actions like posting a tweet or following an account; now some form of authentication—blanket credentials for an application or user-granted permission for applications to view their accounts—is required for every endpoint in the API, with varying rate limits on how much any given method can be queried in a 15-minute window. These limits can slow down things immensely for apps. For instance, Twitter’s new analytics portal allows you to see the gender breakdown of your followers – for me, it reports 72% of my followers are male, 28% are female – but what if you wanted to calculate something like that for a large account like @nytimes. The bulk followers method lets you retrieve 5000 user IDs every minute. To retrieve all 12.9 million followers of the nytimes account would take 2580 minutes or about 1.8 days. And this method only returns user IDs. To actually retrieve information about each user requires a different method that allows up to 400 user records to be downloaded per minute, which would mean waiting 22.4 days to calculate the gender breakdown of that account on your own. Worse still, some API methods contain hard limits. For instance, the user/timeline method can return a maximum of the 3200 most recent tweets for a given user. To give some context, I have posted over 85,000 tweets in my life, meaning at most 3.7% of my entire timeline can ever be analyzed by programs and I keep reducing that proportion with every tweet I write.

Admittedly, @nytimes is an extreme outlier, but forcing a wait for even a few minutes for an analysis makes your apps more complex and people less likely to wait around. Furthermore, the only viable means for most aplications to use the API is to request users login and grant them permissions to read and sometimes write to their accounts. This process is somewhat convoluted and many users might balk at the screen asking if an application can have permission to look at their account. This is why most applications choose speed over accuracy, often looking at only the most recent tweets or followers for any given account. In the case of Twee-Q, they bypass any form of application or user authentication at all and look only at the most 100 tweets for any account to make their calculation.

How accurate can that be? It’s a bit like owning a fitness tracker that has only remembers a single day or even only a few hours. Looking at only the most recent window of events is not necessarily wrong, but it’s also not exactly the same as using a complete data set or a randomized sample. Admittedly, Twee-Q is more of a toy—albeit a toy with a social message – than a news application like we normally cover here at Source. But of course many news organizations have built similar widgets of their own. I was curious what it would be like if those API limits were different, so I tested with my own personal local Twitter API sandbox by downloading my complete tweet archive.

Revisiting the Past

I was able to do this thanks to a feature Twitter rolled out in the past year, the ability to download and browse your own complete archive of tweets. They provide the archive in two main formats for consumption: a basic CSV of all your tweets and a dynamic interface you can open locally in your web browser. The beauty of the latter is that it loads all of its data from a separate directory of JSON data files organized by month and year. In addition, Twitter has put some effort into identifying retweets (both automated and manual) and flags them in the data by transcluding information about the original tweet (or status in Twitter’s jargon) into the JSON for the retweet message like this:

"retweeted_status" : {
    "source" : "\u003Ca href=\"http:\/\/www.apple.com\" rel=\"nofollow\"\u003EiOS\u003C\/a\u003E",
    "entities" : { ... },
    "geo" : { },
    "id_str" : "469866710192128000",
    "text" : "Andreessen bot responds to journalism job postings with \"this should be a bot.\" http:\/\/t.co\/Sf6N5NVf1l",
    "id" : 469866710192128000,
    "created_at" : "2014-05-23 15:45:28 +0000",
    "user" : {
      "name" : "Lois Beckett",
      "screen_name" : "loisbeckett",
      "protected" : false,
      "id_str" : "21134925",
      "profile_image_url_https" : "https:\/\/pbs.twimg.com\/profile_images\/2187277560\/loispp_normal.jpg",
      "id" : 21134925,
      "verified" : true

All of which makes it pretty easy to identify which tweets in your timeline are retweets and the original users who wrote them. So, that is basically how my version of the Twee-Q algorithm works. By running it against my entire archive, I can explore three questions I had about the Twee-Q algorithm:

  1. How accurate is it to guess the genders of twitter accounts anyway? How much do bots and brands interfere with the process?
  2. What difference would it make in the final calculation if the Twee-Q algorithm was able to look back at more tweets than 100? Would a slightly larger sample have a bigger effect?
  3. Just how reliable can a single measurement on 100 tweets be? Does the ratio stay pretty consistent or vary wildly from day to day?

To answer these, I wrote two scripts that work once you download your tweet archive from Twitter and save it to a subdirectory in the project:

  1. analyze_gender.rb-runs through the archive guessing the genders of every account retweeted. These guesses are saved to a separate CSV file in the tweets directory with a second column that allows you to correct any miscategorized accounts.
  2. analyze_retweets.rb-runs through the archive analyzing the timeline. It will tally the gender miscategorizations recorded in the CSV file first. Then it anaylzes all tweets before outputting two several CSV files to help answer questions 2 and 3.

The Problems with Guessing Gender

One of the appeals of Twitter is that you don’t need to share much about yourself to start talking. Twitter doesn’t require users to reveal much of their personal identity, providing only a few sparse fields – a name, a short bio, location – that can be set to anything (or nothing) that users might choose. This is in such stark contrast to Facebook’s ethos of capturing every possible connection between its users that it might seem ludicrous to use Twitter as a basis for any demographic study at all. However, Twitter’s focus on public conversations instead of private connections is what makes it irresistible for people studying topics like political speech, hate speech, breaking news and global events. All of these case studies involve inferring some sort of demographic detail from the meager data provided by users. How well can that work?

Gender seems like it would be an easy thing to infer at first glance. Unlike political orientation for instance, it doesn’t require looking at a user’s tweets or connections; it’s as simple as comparing the name on the account to lists of known male and female names and guess based on that. Twee-Q used lists from a few national censuses. I don’t have their exact list, but there is a Ruby gem named sex_machine that provides the same functionality and also meets the Ruby world’s penchant for picking wildly immature names for software libraries. Given a name, Sex Machine makes a guess whether the name is male, mostly male, female, mostly female or unclear. Unclear cases could be things like brands (can you really gender The New York Times?) as well as names used relatively equally by both genders (like Courtney or Lindsey).

The analyze_genders.rb script runs the sex_machine gem on all the retweets in your archive and then dumps a CSV that can be hand-checked and corrected with the actual gender of all accounts. In the case of my own Twitter history, I was surprised to see that my gender analyzer guessed wrong for around 15% of the names it encountered. Here is the detailed breakdown of error types.

Guessed Actual Count
Female Male 20
Female None 43
Male Female 17
Male None 12
None Female 228
None Male 341

It’s clear that the vast majority of those mistakes happened when the classifier was unable to guess the gender of an account rather than misidentifying a male as a female or vice versa. I noticed a few reasons why this would happen:

  • Although I tended to mostly retweet Western-style names, the gender analyzer was generally flummoxed by other types of names.
  • A sizeable number of accounts did not provide a real name at all. In these cases, I guessed the actual gender by looking at the avatar photo, but that process is obviously error-prone.
  • Of those accounts that did not provide a name, the bulk of them simply repeated their Twitter username in the name field. The extent of this practice surprised me.
  • In a quirk that is possibly specific to the sex_machine gem, all accounts that started with “The” like “The New York Times” were misidentified as female names. Not enough to distort the errors significantly but it does show one way this process can be thrown off.

Admittedly, these observations are specific to my tweeting patterns, but the number of gender misidentifications was far higher than I expected. Moreover, the majority of errors were actually failures to guess any gender, not my script confusing male for female or vice versa. Assuming my retweets are an accurate sample of Twitter as a whole, a sizeable number of Twitter users obscure their online identities in some basic fashion. In most of these cases, a simple glance at the user avatar reveals the user’s gender if it’s actually the user’s photo and that’s not an accident. Given the aggressive nature of spambots that plague Twitter, it makes sense to be a little coy with fields a machine can parse while still being upfront with photos a human can understand. And of course, sometimes a user’s identity is entirely fabricated. I doubt that @subtweetcat is an actual cat, for instance.

Ambiguity is unavoidable. This doesn’t make Twitter research meaningless. It just means that researchers should accept and be upfront with their readers that there will always be some level of fuzziness in their analysis. Twitter’s own analytics tool presents an illusory certainty about the gender breakdown of my followers, with no explanation of how they handle unclear cases. Reality is much more complicated.

A Window into My Soul

How much of an effect does the limited window used by Twee-Q affect its results? Putting things formally, the goal of any analysis like Twee-Q is to infer a specific characteristic about my entire Twitter usage from only looking at a small window on my timeline. The analyze_retweets.rb script explores the question of accuracy informally by chugging through my timeline in reverse chronological order (ie, how any bot would) and outputting two CSV files that show:

  1. What are the effects of a using an ever-larger window? This CSV recalculates the male/female retweet ratio at ever-expanding windows back from the most recent tweet to see how it changes.
  2. How variable is any given 100-tweet window? This CSV recalculates the male-female ratios for a sliding window of 100 tweets starting with each prior day in the timeline to see how wildly the score varies from one day to the next

I wish I had a solid enough understanding of statistics to provide exact answers to either of these questions. Instead, I can only observe the hallowed digital journalism tradition of presenting several charts with commentary. For starters, here is the adjusted average when the window looking back increases. The Pct value represents the percent of accounts that I have retweeted that are male vs. female. Accounts with no specific gender are not included.

Line graph

It is no surprise that the computed percentage converges to a value eventually. In this chart, the average swings a bit wildly at first when the window is low before steadily climbing upward from 0.65 to a final value of 0.6992 – ie, as I go further back in time, the percentage of my retweets that are from men increases. The straight line in the end doesn’t indicate a dogged consistency from early Jake, but rather the end—or rather, the beginning—of Twitter’s retweet mechanisms (manual and native retweets). But before it flatlines, the curve takes a surprisingly long time to amble upwards toward the eventual average that represents the score for all of my tweets. Why? I don’t know for sure, but I think the answer has something to do with the second CSV generated by that same script. It simulates the effect of running the Twee-Q calculation on each separate day of my timeline, looking at the prior 100 tweets starting at that day.


I wrote this script originally because I was curious just how accurate only looking at 100 tweets from an account could be. Twee-Q is admittedly an extreme case, but every analysis of Twitter accounts involves the same tradeoff; they usually examine only a tiny subset of a user’s tweets. We’d like to think that this little sample mirrors the properties of my entire timeline, but it is also pretty variable, as the chart makes clear. Still, could you use this approach on a few random days to average out the variation and get a better estimate? Yes, but that would probably be wrong. All of my meager statistical chops are based on the assumption that any two events are statistically independent, and these scores aren’t independent. If I know the score for one day, I can make a few rough guesses about the score the next day, since that will involve some of the same retweets in the prior day’s score. These scores are less like truly independent samples and more like a moving average. I’m sure there is some cool analysis to be done with this, but that’s beyond my own skill level in statistics.

An alternative is to make each sample independent. Just for kicks, I created a third CSV file sample.csv that repeatedly picks a random 100 tweets from my entire timeline and computes the male-female retweet percentage from that selection. Each of these runs is truly independent from the next, which makes things a bit more palatable for statistical analyses. And indeed, if you compare histograms of the spread between the sliding and sample, the latter more resembles the Normal Curve we would expect, while the former is just unbalanced and weird. This affects the resulting calculations too. Contrast the summaries of these two approaches:

Method Min. 1st Qu. Median Mean 3rd Qu. Max. Std. Dev.
Sliding 0.2162 0.6111 0.6857 0.6798 0.7500 1.0000 0.1153777
Sampled 0.2941 0.6316 0.7037 0.7002 0.7692 1.0000 0.1012754

The resampled version has less spread and an average that is far closer to the correct population average of 0.6992 which the first CSV converged to. A difference of roughly 0.2 might seem like no big deal, and indeed this is a bit silly, but it seems important too. The sliding method simulates exactly what would happen if I ran Twee-Q’s query against my timeline on successive days, and it yields a worse result than true sampling, but it’s the only automatic mechanism that Twitter provides for applications.

So What?

How much do you retweet men vs. women? What percentage of your followers are bots? Did Mitt Romney buy Twitter followers? What kind of personality do you have? These are all questions that have been asked about Twitter by recent tools and stories. And all of them have inferred their answers by looking at only a limited selection of the a user’s Twitter information, whether it’s the 1000 most recent tweets, or the 5000 most recent followers, or the like. Could we do better?

In an ideal world, the Twitter API would provide sampling equivalents to the user_timeline, favorites/list, followers/list, search/tweets and any other endpoints whose API limits are quickly exhausted for any notable accounts. There already is a sampling endpoint for the streaming API, although some have questioned whether that sampling is truly unbiased. To easily and accurately answer questions like those above and others, sampling versions of these other endpoints would be better, particularly when accounts of interest have scaled so far beyond what the API restrictions allow to be investigated in a reasonable time period.

Without that, what are the best practices for deriving data from Twitter? Can we fake sampling the data with the tools we have? Is looking at 100 tweets good enough? Would picking 400 for instance be much better? Statistics suggests that’s only twice as good, but would the increased API usage and quality data mean less quality improvement than that? Are there tricks we can try with the search API to get past the hard limits of the user_timeline method? What are some other pitfalls we should be wary of when using social media data? These are good questions best tackled by someone much more talented at statistics than most journalists playing around with Twitter (including me). Luckily, we are not alone. Twitter has overwhelmingly become the choice for researchers investigating social media, and with that has come some honest acknowledgment of big problems and possible approaches. Maybe a team of social researchers, programmers, and journalists could figure out the best ways to answer the same questions we find ourselves asking about Twitter.

Finally, I think it’s important to remember that Twitter users are (mostly) human beings who have express reasons for preserving their privacy. I originally wanted to share my corrected list of genders for the Twitter users I’ve retweeted under the ethos of “showing my work.” But doing that would mean sharing a machine-readable file that bypasses some of the obfuscation that Twitter users have chosen precisely to avoid being easily analyzed by spambots and other programs. One can imagine other situations involving Twitter where sharing the data might mean inadvertently becoming part of the problem. Simply put, are there ways we can be as transparent with our work while also reflecting the privacy of the users we are researching? What about our tools? How responsible are we if someone uses our Twitter widget to report “facts” about other Twitter accounts derived from opaque methodologies? Again, I hope there is some insight from other social research fields on how to balance the need to protect user privacy with being upfront about our own needs.




Current page