Mining Social Media: Finding Stories in Internet Data
An excerpt from Lam Thuy Vo’s new book for journalists on how to analyze social media data.
Today we’re featuring an excerpt from Mining Social Media: Finding Stories in Internet Data by Lam Thuy Vo, which is being released this week and is available for purchase. This content was originally published in slightly different form by No Starch Press. —Ed.
We experience the social web in brief moments that flash by, often without ever coming back to them. Liking a photo on Instagram, sharing a post that someone published on Facebook, or messaging a friend on WhatsApp—whatever the specific interaction, we do it once and likely don’t think about it after.
But from swipes to clicks to status updates, our online lives are being captured by social media companies and used to fill some of the largest data servers in the world. We are producing more data than ever before. By looking at these data points as a whole, we can gain tremendous insight into human behavior. We can also investigate the harm done by these systems, from detecting false online actors (for example, automated bot accounts or fake profiles that seed misinformation) to understanding how algorithms surface questionable content to viewers over time.
If we look at these data points collectively, we can find patterns, trends, or anomalies and, hopefully, better understand the ways in which we consume and shape the human experience online. This book aims to help those who want to go from simply observing the social web one post or tweet at a time to understanding it on a larger, more meaningful scale.
What Is Data Analysis?
The main goal for any data analyst is to gain useful insights from large quantities of information. We can think of data analysis as a way to interview a vast number of records: we may ask about unusual single events, or we may be looking into long-term trends. Interviewing a data set can be a lengthy process with various twists and turns: it might take a few different approaches to find the answers to our questions, the same way it might take a few different meetings to get a good sense of an interviewee.
Even if our questions are simple and focused, getting to the answers can still require us to make several logical and philosophical decisions. What data set may be useful to examine our own behavior, and how would we get that data? If we wanted to determine the popularity of a Facebook post, would we measure that in number of reactions (likes, hahas, wows, and so forth), the number of comments it received, or a combination of both metrics? If we wanted to better understand how people discuss a specific topic on Twitter, what would be the best way to categorize tweets about it?
So while analyzing data takes a certain amount of technical know-how, it’s also a creative process that requires us to use our judgment in an intentional and informed way. In other words, data analysis is both science and art.
Who Is This Book For?
This book is written for people who have little to no previous programming experience. Given the huge role of social media, the internet, and technology in all of our lives, this book aims to explore them in an accessible and straightforward way. Through practical exercises, you’ll learn the foundational concepts of programming, data analysis, and the social web.
On some level, this book is targeted to someone just like my former self—a person who was fiercely curious about the world but also intimidated by jargon-filled forums, conferences, and online tutorials. We’ll take a macro and micro approach, looking at the ecosystem of the social web as well as the minutiae of writing code.
Coding is more than just a way to build a bot or an app: it’s a way to satisfy your curiosity in a world that is increasingly dependent on technology.
What This Book Covers
The chapters of this book are structured to follow the journey of a data sleuth. We’ll begin by covering how and where to find data from the social web. After all, we need data before we can go about analyzing it! Then, in the later chapters, you’ll learn about the tools necessary to process, explore, and analyze the data we’ve mined.
Part I: Data Mining
Chapter 1: The Programming Languages You’ll Need to Know
Chapter 2: Where to Get Your Data
Explains what APIs are and what kind of data you can access through them, and walks you through accessing data in JSON format. This chapter also covers the process of formulating a research question for data analysis.
Chapter 3: Getting Data with Code
Shows you how to gather the data returned from the YouTube API and use Python to restructure it from JSON to a spreadsheet, specifically a .csv file.
Chapter 4: Scraping Your Own Facebook Data
Defines scraping and describes how to inspect HTML to structure content from web pages into data. It also covers data archives that social media companies provide to users of their own data and shows you how to extract data into .csv files.
Chapter 5: Scraping a Live Site
Explains the ethical considerations of scraping websites and walks you through the process of writing a scraper for a Wikipedia page.
Part II: Data Analysis
Chapter 6: Introduction to Data Analysis
Covers the various processes involved in data analyses and introduces Google Sheets by analyzing data from an automated account, or bot.
Chapter 7: Visualizing Your Data
Explores how visualization tools—like making charts within Google Sheets and using conditional formatting to highlight data variations—can help us better understand our data.
Chapter 8: Advanced Tools for Data Analysis
Transfers concepts you learned from analyzing data in Google Sheets into the realm of programmatic analysis. You’ll see how to set up virtual environments in Python 3, navigate Jupyter Notebooks (a web application that is capable of reading and running Python code), and use the Python library pandas. You’ll also explore the structure and breadth of your data sets.
Chapter 9: Finding Trends in Reddit Data
Builds on the previous chapter to show you how to modify data, filter data, and run basic aggregation using functions in pandas.
Chapter 10: Measuring the Twitter Activity of Political Actors
Explains how to format data as timestamps, modify it more efficiently with lambda functions, and resample it temporally in pandas.
Chapter 11: Where to Go from Here
Lists resources for becoming a better Python coder, learning more about statistical analyses, and analyzing text using natural language processing and machine learning.
Empowering Journalists to Leverage Data
So many of our interactions and our behavior are now captured on social media platforms. While companies like Facebook or Twitter have certainly found ways to leverage this data in aggregate, I firmly believe that researchers and users themselves should be enabled and empowered to glean their own insights from some of these vast data sets. This book offers a beginner-friendly introduction to this kind of data analysis.
I’ve been an instructor for more than a decade and truly love seeing students and peers succeed. While the scope of this book is limited, I hope that it sparks enough curiosity in beginners to compel them to continue learning. For that purpose, feel free to explore the majority of my teaching materials on my website.
Next on Source: A Q&A with Lam Thuy Vo about the book and data analysis for social media.