This is a piece we wrote for the third annual Source Botweek, describing a tool we built in the last couple of months:
At Vox Media, data science and data engineering are working together to build products with editors’ and journalists’ needs in mind. One such experimental product is a tool that enables editors to discover relevant content on demand.
Given Vox Media’s history of building successful Slack bots, and the broad adoption of Slack among the editorial staff, we decided to implement the tool’s interface through a Slack bot, which we have named simbot. The neat thing about this implementation is that it can be built quickly, without requiring a specialized user interface or maintenance through a larger system. Additionally, the benefit on the editorial side is that users can make queries instantly, without having to use a separate interface or a unique login to access the results.
When a Vox Media Slack user sends a direct message to simbot specifying a seed article URL and a number of desired results, then the bot would return a ranked list of articles that are most similar in content to the article found in the provided URL. This is what a typical Slack interaction looks like:
This project began as a data product idea for discovering and understanding the relationships between different pieces of content that Vox Media has published. Before developing the actual product, we had conversations with a few different editors to understand their needs when it comes to interacting with existing content. One of the common themes that emerged was the need to have a tool that allows finding similar articles, especially older articles that people have forgotten about or articles written by others.
Prior to the Slack bot solution, there were three main ways for editors to access relevant content: 1) using keyword searches through search engines, 2) using keyword searches through tools based on Google Analytics that allowed them to discover popular content, and 3) through manually curated lists of ‘evergreen’ content. In contrast, the simbot solution is able to fetch the full text of an article, analyze it and return similar articles based on this richer search context.
Currently, there are two main applications for the bot. The first one is to enable editors who post Vox Media content on social media, such as Snapchat or Twitter, to discover related Vox Media content and build a storyline through the discovered results. The second one is to enable journalists who are writing new articles to find related content that they can link to.
The basis of the algorithm for finding similar articles is a neural network, which takes the words of each article and projects them into vectors of numbers. We then aggregate the word vectors for each of the words in an article to come up with an article vector. The vectors of numbers allow you to easily uncover the relationships between words and articles by applying different similarity measures, such as cosine similarity. Specifically, the neural network algorithm is 'word2vec,' which was implemented through the python library 'gensim.' We tried other algorithms as well but the feedback from editors on the provided results was not as positive.
When we first started working on this tool, we had a simple Python script to clean up all published articles in our database and train a word2vec model. We stored similarity values for all pairs of articles in Redis. In early iterations of the bot, we were running our own Redis server, but we eventually switched to using AWS-managed Elasticache.
After having this script in place, we began thinking about regular updates for new articles. Our first iteration involved scheduling a regular cron job that would reprocess all articles and update the model. However, this meant that the bot may not have results on the latest articles, and we eventually moved to an event-based solution. Every time a new article is published, we receive an event on a Kafka queue, which kicks off a process that updates the model and similarity values.
We also have a simple REST web service created using Flask that outputs related articles ranked based on their latest stored similarity values. The Slack bot queries this web service and adds a dash of formatting into the mix before outputting to Slack.
At the end of each set of results, we provide users the opportunity to submit feedback, and we are continually improving the tool based on that feedback. The initial feedback from editors has been positive and very helpful. Based on their suggestions, some of the items that we plan to address in future versions of the bot are assigning higher weights to title words vs. article body words, and the ability to feed in seed articles that are external to Vox Media.
For those who are interested in developing a similar tool for their organization, our advice would be to involve their users in the design process as much as possible and to make their project evolve according to the users’ needs. We plan to open-source our code in the next few months.