Mining Social Media: A Q&A with Lam Thuy Vo
Her new book, how you can get started using social media data, and why analyzing this data is so important right now
Today we’re featuring a Q&A with Lam Thuy Vo, author of Mining Social Media: Finding Stories in Internet Data. It was released in late 2019 by No Starch Press and is available for purchase or free download. Earlier, we featured an excerpt of this book on Source.—Ed.
Q: Can you tell us how you got started, working with social media data?
I got into social media data in 2013 when I was going through a divorce, actually. At that same time, I found out that the average office worker was leaving behind a data footprint of five gigabits, every single day. That was in 2013, according to an MIT report. If I had gone through your computer at that point in time using digital forensics tools but also mining data from social media and of the infinite platforms or technology-based platforms to hoover up all of the information that you left behind by just being a person, I would have found five gigabits worth, every single day.
At that point in time, one of the things that I wanted to look into was how I was recovering emotionally as seen through the lens of my social media data and the data that is collected on my devices—my email, on any kind of digital and technology-related device. I started realizing just how potent that data was because I started to really find traces of my behavior, but also of certain feelings of mine over time. That’s when I realized that social media data, in particular, is like longitudinal data of our behavior over time. It is imperfect. It has its strengths and a lot of weaknesses too. But if you find a way to interpret it with other folks who have actually produced the data, or if you can find a way of interpreting it within the universe it’s representing online, you can find really interesting and marvelous insights into human behavior.
I think in many ways I came to it with a very human-centric focus… Data is oftentimes seen as this very neutral, oftentimes very macro understanding of something. I was always drawn to human stories, even though I understood the intellectual and macro understanding of something very important as well. But social media was this wonderful way for me to marry these two passions of understanding how humans behave and how I can use and leverage data on a bigger scale.
It started out as this very personal, human-centric way of understanding stories and storytelling. Then I started recognizing that it went way beyond that.
How did I get started writing this book? I started my fellowship at Buzzfeed in October 2016, the month before the presidential elections kind of flew into every journalist’s face because you didn’t know it was going to happen, or at least a lot of journalists didn’t know it was going to happen. It showed how segregated our information universes were. I started seeing that there was this incredible force for understanding people, important to understanding where our society is headed as we are increasingly consuming information very much within the confines of our filter bubbles. I realized that [social media] became a sociological tool, an ethnological tool, and a tool for accountability because every emotional footprint that is left online is also a footprint that can be used by bad actors or people who are trying to manipulate.
Really, it came down to more and more people in different fields coming up to me and asking “Can you teach me this?” and “Can you show me how to do this?” I think the most critical moment was when I was doing a workshop for the Berkley University Law Center for Human Rights. They were looking at human rights violations as they were documented online. That’s when I realized all these lawyers who had zero background in coding are doing incredible work manually. They could be so much more effective if they had someone who would not just work with them but who would empower them to do this kind of work because understanding social media and understanding data and then using and leveraging code to harvest this data and analyze this data go hand-in-hand. You can’t look for the right data if you don’t know what the question is that you want to answer with it, right?
In many ways these field experts, these people who know how to go after human rights violators or people who are at the forefront of civil rights—people who are advocates, people who are looking to hold the powerful accountable— oftentimes don’t have the technical skills to be as effective as they want to be. But they have an incredible expertise that journalists don’t.
My goal with this book—and I wrote it with the understanding that my publisher would allow me to publish it for free online as well—was to make this as accessible as possible. I don’t like the idea of a few nerds hogging this skill and hogging one gift to have attention. Especially when these tools are increasingly becoming free and accessible online.
Then I was doing a workshop on social media and data mining in Kenya and people were like, “Oh my goodness, this could help us so much to hold our government accountable.” So it really is people who are on the ground who are doing the good work already who just need a scraper, a little robot that does repetitive tasks for them on a much larger scale, that inspired me to do this book.
Q: Have you heard any potential readers see the topic of your book and say, “What about privacy? We’re all so concerned with giving away social media data, and yet are advocating for journalists to go digging around through it.”
There are different tiers of social media data that are out there, and I think that one of the things I’m trying to make with the book is an argument for being very careful about how you amplify and what kind of questions you ask of it. There have been various ways in which social media access has narrowed a lot based on new privacy regulations that came out of Cambridge Analytica… In the chapter I called “The Quantified Selfie”—data archives that you can download about your own data—I really encourage readers to look at it from a standpoint of you need to be responsible with how you interpret the data and what you try to do and know the implications of what it means to amplify it beyond an intended audience.
That means that certain types of stories should be done in tandem with the source. If I do a “quantified selfie” with someone else’s data, it means that I have to be responsible for protecting them, and understanding how to interpret the data through the lens of the person. There are a few sections where I’m trying to teach people how to be ethical about certain things, from being an ethical and polite scraper all the way through understanding how to responsibly interpret data. I hope that that does something to bring about a responsible way of doing research when the rest of the world is doing no-inhibitions, commercialized datamining.
One of the things I hope comes out of this is that a) people will become more aware of what data is out there; and, b) that it becomes a counterbalance to the ways in which this data is leveraged behind the scenes by big companies.
Q: Can you walk us through a piece that you’ve done, or an example that helps us understand some interesting ways to apply this?
One thing thatthis storytelling can do is teach people how to understand the internet and how to understand the rules that govern how we consume information now.
It’s been a really weird time in the past three years to be on this beat because in many ways I am writing about how people consume my content and other people’s content. It’s very meta…. We did a profile, two quantified selfies, of a conservative mom and a liberal daughter where we gained access to the basic news feeds of the conservative mother and the liberal daughter. Two people that love each other. We wanted two people who love each other and want to understand each other and then we wanted to show how the digital information segregation has really changed the ways in which they interact, at least online.
We were able to look at every single post and then started looking at the clusterings: which person shows up the most on the news feed of the conservative mom and which person shows up on the liberal daughter’s Facebook newsfeed? On the mom’s side, I think it was like an old friend who she’d known for decades. For the daughter, who had moved away and become an advocate for social justice, for her the top person to appear on her newsfeed was the regional director of the regional ACLU.
It really was this interesting thing where we were able to show how, at least in the case of these two people, specific choices that these two folks made—the mom staying in one place and the daughter moving away and getting involved in a specific field—these choices suddenly warped entire worlds that they inhabited when it came to information. I know that this is not statistically comprehensive, right? But it is a really helpful way of understanding through an example how something like information segregation or social bubbles can really affect these relationships between two people. That’s just one of the ways you can use that data to really drill in on something.
More whimsical approaches that are used: a Christmas present I made for my niece to show how the text messages my family sent to each other increased [after she was born] and she became a sort of glue for us. It’s more of an artistic piece. I made her a little lullaby of the frequency of texting that did. To some degree when we become adults we drift apart as we become human beings, and then when there’s a grandchild involved everyone jumps back into this whole investment of seeing someone grow. That was an interesting way of doing a story.
There was another story that is a bit more serious where we looked into harvesting hate speech examples of people who had been elected into the MyanmarBurmese government and then finding out that one in ten of these posts that they put on their timeline included hate speech. So there’s different ways of applying this data when we start understanding that the information that people put out is a trace of their specific behavior.
Q: It’s interesting to see how this more than other forms of journalism, especially data journalism, is so inherently personal. I feel like data journalism tends to look at things in the abstract and this is people’s lives just sort of laid bare. Why is it so important to do this [social media-driven data journalism] now?
Journalism is really losing its ground on a local level and on a national level because of how distribution systems have changed. The way we find a story has a lot less to do with serendipity and a lot more with algorithms that tailor results to you. We’ve gone from a newspaper where you can really open up the pages and stumble upon an article in the corner about something we never expected to see, to coming to internet-based consumption level where there’s less and less editorial input from humans and a lot more of this effect where an algorithm collects of your most extreme reactions to something. Because an article that’s complex—and makes us think and makes us walk around in the world differently—doesn’t actually solicit any “hahas,” “wows,” like-buttons, hearts and angries and “wows!”
My friend and I often refer to this as the tyranny of the loudest. The social web only measures humanity in its extreme emotions which are solicited through content that makes us have reactions rather than thoughtful periods of time where we change and think. Because the internet and algorithms are fueled by that data, we are also more likely to only get that kind of information back. That’s how an existing small difference then becomes a much larger difference.
Q. That feels like the heart of everything that’s going on right now.
Yeah! And it’s really sad, because I think in many ways serendipity, taking your time, all of these things that encourage critical thinking, have been cut in half by the dopamine that we get from seeing a “like,” pushing a button, from sharing something that makes us outraged or sharing something that makes us feel elation. So I think what’s so important about this kind of field is that we need to reeducate people in the art of understanding what they are looking at rather than having facts be completely enveloped in emotions at all times. On top of that, I think that one of the things I hope this book does too is to really show people that people need to not equate the internet with the world. One of the biggest fallacies in journalism and how wethey interact with social media data is that a lot of people start out equating Twitter with the general population. That is not right. And that goes back to the tyranny of the loudest.
I hope and I believe, having seen some of the data and having run a few smaller stories, that the majority of the people are actually more passive consumers and that there’s a small minority that actually hijacks attention.
Whether it’s malicious or whether it’s just loud, angry people or loud, happy people — that minority that’s really loud gets to dictate what people think is the entirety of society. It goes both ways. Not only does it fuel what comes back onto your timeline, it also is the wrong reflection of what people are.
The large contingency that’s much more moderate about a lot of things is never heard because that is not how the social web is engineered. When you do stories about using and leveraging social media data, you need to understand what the universe is that you are in. What the rituals and language is of that universe. And you need to treat it like a beat in itself. Going into 4Chan or 8Chan is very different than understanding, let’s say, the Greta Thunberg movement. They all come with their own particular set of people, their own actors and mayors and citizens of that particular universe; laws within that universe; rituals and rights and things that are particular to that in-group, and how they define the out-group. Understand that the Internet does not represent the world, at least when we are trying to measure it through the content it produces. And the internet in itself is kind of like a conglomeration of different, small features that sometimes overlap. That’s something that we touch upon in the book, and that I hope people get out of that.
Q. Have you thought about a follow-up?
Oh, no no. I don’t want to do another coding book cause that’s like excruciating. For like two years, I didn’t have a weekend. But I always saw my North Star. It was my students and the kinds of people that I meet at Code Book Kenya, in Nairobi, or the law students who really want to do this stuff. Or the 18-year-old kid I meet during a workshop where I talk about social media data as a way that a lot of people incriminate black and brown youth. For me, that is my target audience and that is who I am trying to serve with this book.
If I want to think about the next book really I would love to write a series, like an episodical understanding of how social media has distorted a lot of the ways in which we see society through quantified selfies of individuals.
Also: Last year I ran an experiment. I basically labeled emails or any text messages, Slack messages, Twitter DMs, everything that I got that was related to someone who barely knew me or had known me and not talked to me in years. I found that I had more than 200 emails last year from random people asking me for help. One of the main reasons why I wrote this book is so that I can pass on that link. It’s like writing reproducible data analysis, but writing something that makes my helping reproducible.