clock menu more-arrow no yes mobile

Filed under:

Faster Analytics through Elasticsearch

This is a story about how we optimized the performance of our video analytics at Vox Media using Elasticsearch.  To understand the problem, allow me to introduce you to the application that was optimized.

Volume

At Vox Media, we use an application named Volume to manage the production and usage of video on our many sites. One aspect of this management is presenting analytics about the top-performing videos and pages containing videos. Through Google Real-Time Analytics, we measure the number of times a video comes into the viewport of a web browser and the number of times a video is played. This allows the video team to promote high-performing videos and discover opportunities for placing video on top pages.

We pack the fields in event fields in Google Analytics with information about the context in which a video occurs, including information about the type of page and the location of the video on the page. We must unpack this data before it is presentable to the user, so we pull the data into Volume for post-processing and display.  Here is an example of the data we present for each video. It includes overall performance, and breakdowns to the different page locations where the video appears and to the individual pages that displayed the video.

Volume Analytics

Example of Analytics Data for a Single Video

Originally, we used an ActiveRecord / MySQL approach to store the post-processed data. This worked acceptably when we were only processing analytics for The Verge, but as we started to add other sites, we found this approach was overloading our database and our Rails app servers.  The initial buildout of the functionality had to be done very quickly to be ready for the Consumer Electronics Show, and it was admittedly un-optimized.

Elasticsearch

When we went back to consider ways to optimize the analytics tools, I became aware of Elasticsearch Aggregations. Aggregations allow you to summarize data at many different levels, all in one query. For example, when we query to get information about top videos, we are summing up views and plays for:

  1. each combination of page and location on the page
  2. all pages that use the video at a particular location on the page
  3. all usages of the video
  4. all analytics for the top 16 videos.
Here is an example of what the query for top videos looks like:

{
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "provider": "The Verge"
                    }
                },
                {
                    "term": {
                        "@timestamp": "2015-06-01T15:34:16-04:00"
                    }
                }
            ]
        }
    },
    "aggs": {
        "total_plays_overall": {
            "sum": {
                "field": "plays"
            }
        },
        "total_views_overall": {
            "sum": {
                "field": "views"
            }
        },
        "all_video_titles": {
            "terms": {
                "field": "video_title",
                "size": 16,
                "shard_size": 64,
                "order": {
                    "total_plays_by_video": "desc"
                }
            },
            "aggs": {
                "total_plays_by_video": {
                    "sum": {
                        "field": "plays"
                    }
                },
                "total_views_by_video": {
                    "sum": {
                        "field": "views"
                    }
                },
                "all_locations": {
                    "terms": {
                        "field": "higher_location"
                    },
                    "aggs": {
                        "total_plays_by_location": {
                            "sum": {
                                "field": "plays"
                            }
                        },
                        "total_views_by_location": {
                            "sum": {
                                "field": "views"
                            }
                        },
                        "unique_links_by_location": {
                            "cardinality": {
                                "field": "page_title"
                            }
                        },
                        "all_titles": {
                            "terms": {
                                "field": "page_link",
                                "order": {
                                    "total_plays": "desc"
                                }
                            },
                            "aggs": {
                                "total_plays": {
                                    "sum": {
                                        "field": "plays"
                                    }
                                },
                                "total_views": {
                                    "sum": {
                                        "field": "views"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

You can see how the nested structure allows the calculation of sub-totals at multiple levels all at once.

The amazing thing is that the Elasticsearch query generally runs in less than 100 milliseconds (measured on my laptop development environment to avoid network overhead). I did some profiling with using Ruby Prof to compare the Elasticsearch approach with the ActiveRecord approach. The Elasticsearch approach took:

  • 89% less process time
  • 87% less wall clock time
  • 74% less object allocations, greatly reducing the garbage collection that was taxing our app servers
  • 99% less save operations (INSERT or PUT) to store the data
  • 99.9% less query statements to display the data.

As noted above, the ActiveRecord approach could have been optimized to reduce the number of SQL statements, but it was far simpler to convert it to Elasticsearch.   Even an optimized ActiveRecord approach would require many complex SUM queries with additional processing needed in Rails.

This approach also gave us two other advantages.  First, it gets the often-refreshed analytics data and operations off of the MySQL database so that it can do its main jobs -- storing, managing, and displaying information about videos.   Secondly, we plan to do time series graphs of the analytics so that we can see trends in the rise and fall of the popularity of the video.   Due to the vast amount of data required, this was not a realistic possibility with MySQL.

What to consider when considering this approach

The most time consuming part of the development process was learning the aggregation syntax and then slowly building up my queries to get them to do what I wanted. To learn the syntax, I recommend reading the posts on the Elasticsearch blog related to aggregations, starting with this. To build up your query, fire up the Marvel plugin on your development environment and use the Sense tool to test out your queries.  Sense provides autocomplete to help you build your queries.

If you plan to aggregate on fields that contain text (for example, a title), you will want to use the Mapping API to mark those fields as "not_analyzed" so that aggregation is done on the whole title. If you don't do this, the aggregations will be on the most popular words in the title.

To get more accurate results, consider setting the shard_size larger than the number of results you wish to display.   We use a shard_size that is 4 times the number of results to display.