Simbot, Give Me Five
A Slackbot uses a neural network to find related stories for journalists
At Vox Media, data science and data engineering are working together to build products with editors’ and journalists’ needs in mind. One such experimental product is a tool that enables editors to discover relevant content on demand.
Given Vox Media’s history of building successful Slackbots, and the broad adoption of Slack among the editing staff, we decided to implement the tool’s interface through a Slackbot, which we have named simbot. The neat thing about this implementation is that it can be built quickly, without requiring a specialized user interface or maintenance through a larger system. On the editorial side, the benefit is that users can make queries instantly, without having to use a separate interface or a unique login to access the results.
What Our Users Needed
This project began as a data product idea for discovering and understanding the relationships between different pieces of content that Vox Media has published. Before developing the actual product, we had conversations with a few different editors to understand their needs when it comes to interacting with existing content. One of the common themes that emerged was the need to have a tool that allows finding similar articles, especially older articles that people have forgotten about or articles written by others. Prior to the Slackbot solution, there were three main ways for editors to access relevant content:
- using keyword searches through search engines,
- using keyword searches through tools based on Google Analytics that allowed them to discover popular content, and
- through manually curated lists of “evergreen” content. In contrast, the simbot solution is able to fetch the full text of an article, analyze it and return similar articles based on this richer search context.
Currently, there are two main applications for the bot. The first one is to enable editors who post Vox Media content on social media, such as Snapchat or Twitter, to discover related Vox Media content and build a storyline through the discovered results. The second one is to enable journalists who are writing new articles to find related content that they can link to.
How It Works
The basis of our algorithm for finding similar articles is a neural network, which takes the words of each article and projects them into vectors of numbers. We then aggregate the word vectors for each of the words in an article to come up with an article vector. The vectors of numbers allow you to easily uncover the relationships between words and articles by applying different similarity measures, such as cosine similarity. Specifically, the neural network algorithm is word2vec, which was implemented through the Python topic-modeling library gensim. (We tried other algorithms as well, but the feedback from editors on the provided results was not as positive.)
When we first started working on this tool, we had a simple Python script to clean up all published articles in our database and train a word2vec model. We stored similarity values for all pairs of articles in Redis. In early iterations of the bot, we were running our own Redis server, but we eventually switched to using AWS-managed Elasticache.
After having this script in place, we began thinking about regular updates for new articles. Our first iteration involved scheduling a regular cron job that would reprocess all articles and update the model. However, this meant that the bot may not have results on the latest articles, and we eventually moved to an event-based solution. Every time a new article is published, we receive an event on a Kafka queue, which kicks off a process that updates the model and similarity values.
We also have a simple REST web service created using Flask that outputs related articles ranked based on their latest stored similarity values. The Slackbot queries this web service and adds a dash of formatting into the mix before outputting to Slack.
What It Looks Like
When a Vox Media Slack user sends a direct message to simbot specifying a seed article URL and a number of desired results, then the bot would return a ranked list of articles that are most similar in content to the article found in the provided URL. This is what a typical Slack interaction looks like:
At the end of each set of results, we provide users the opportunity to submit feedback, and we are continually improving the tool based on that feedback. The initial feedback from editors has been positive and very helpful. Based on their suggestions, some of the items that we plan to address in future versions of the bot are assigning higher weights to title words vs. article body words, and the ability to feed in seed articles that are external to Vox Media.
For those who are interested in developing a similar tool for their organization, our advice would be to involve their users in the design process as much as possible and to make their project evolve according to the users’ needs. And for those interested in learning more about the inner workings of simbot, we plan to open-source our code in the next few months.
Yian Shang is a Data Infrastructure Engineer at Vox Media. She works on building systems used to process data across Vox Media platforms and enjoys making sense of (big) data. More about her can be found at http://testofti.me/.
Elena Zheleva is a principal data scientist at Vox Media where she develops machine learning solutions to problems arising in modern media and digital journalism. Prior to Vox Media, she was the head of data science at LivingSocial. More on her interests can be found at www.umiacs.umd.edu/~elena.