Quantcast
Viewing all 305 articles
Browse latest View live

Introducing Raigad - An Elasticsearch Sidecar

Image may be NSFW.
Clik here to view.
raidad-type-lg.png
Netflix has very diverse data needs. Those needs fall anywhere between rock-solid durable datastores, like Apache Cassandra and lossy in-memory stores, such as the current incarnation of Dynomite. Somewhere in that spectrum is the need to store, index and search documents. This is where Elasticsearch has found a niche in Netflix.
Elasticsearch usage, at Netflix, has proliferated over the past year. It began as one or two isolated deployments managed by the teams using it. That usage has quickly grown to over 15+ clusters (755 nodes), in production, centrally managed by the Cloud Database Engineering (CDE) team.
CDE, as does all of Netflix, believes in automating the operations of our production systems. This is what led us to create tools such as Priam, a sidecar to help manage Apache Cassandra clusters. That same philosophy led us to create Raigad, an Elasticsearch sidecar.

Key Features

Integration with a centralized monitoring system

Raigad collects and publishes Elasticsearch metrics to a centralized telemetry, monitoring and alerting system. This is achieved by using the Netflix Open Source project Servo. Raigad’s architecture allows you to integrate into your own telemetry system.

Node discovery and tracking

We’ve included a sample implementation using Cassandra, for Raigad  to keep track of metadata information of Elasticsearch clusters. Every Elasticsearch instance will read Cassandra to discover other nodes which it needs to connect to during the bootstrap. In this sample implementation, Cassandra eases multi-region Elasticsearch deployments by replicating Elasticsearch meta data across multiple regions wherever Elasticsearch is deployed. This could also be implemented using Eureka.

Auto configuration of the elasticsearch.yml file

Raigad provides a range of configuration parameters to tune Elasticsearch yaml at bootstrap time. eg. ASG based dedicated master-data-search node deployments (default at Netflix), multi-region deployments, tribe node setup etc.
Image may be NSFW.
Clik here to view.

Index management

Raigad takes care of cleaning old and creating new indices based on the retention period provided for individual indices using configuration parameters. We currently support daily,monthly and yearly retention periods.

Improvements to run better in AWS

Raigad is used extensively at Netflix in the AWS environment. As mentioned above, for dedicated node deployments we use ASG naming convention. In regards to credentials, it supports Amazon’s IAM key profile management. Using IAM Credentials allows you to provide access to the AWS API without storing an AccessKeyId or SecretAccessKey anywhere on the machine. But if required, you can use your own implementation as well.
Raigad also supports scheduled nightly Snapshot backups to S3 along with Restores at startup or via a REST call. (It uses elasticsearch-aws-plugin underneath)

More Info

You can get more info about the features described above or about how to use and install Raigad here.

Summary

Distributed systems are complex to operate and to recover from failure. If you add to that, the huge scale at which Netflix operates, you quickly need to make a decision of how to operate such systems. You can either scale a team out to handle the load, or build good automation that can monitor, analyze and alleviate issues, automatically. Netflix’s approach has always been the latter. Raigad helps continue this trend, by providing a tool to help manage our growing Elasticsearch deployment.
CDE is very excited to add Raigad to our ever growing NetflixOSS library. If you run Elasticsearch on AWS, at scale, we believe Raigad may be useful to you too. As with all of our projects, feedback, code or documentation submissions are always welcome.
If you are passionate about Elasticsearch or Open Source Software, in general, we are always looking for great engineers.



Billing & Payments Engineering Meetup II

On March 18th, we hosted our second Billing & Payments Engineering Meetup at Netflix. It felt truly encouraging to see the growing interest of the engineer community of the Bay Area for the event. Just like the first event, the theater was almost full.

If you missed our first Meetup, you can check it here.
Image may be NSFW.
Clik here to view.
IMG_3573.jpg
For this Meetup, we decided to take a different approach. We are several teams within Netflix that are involved with billing or payments at various level. Each of us gave a presentation of our work, therefore hoping to provide the audience with a 360, transversal vision of how payments are managed at Netflix. There’s a great synergy between us and we hope it was reflected in the talks we gave.

Stay tuned on the meetup page to be notified of the next event!

Payment Processing in the Cloud - Mathieu Chauvin - Payments Engineering

Now that Netflix has gained a tremendous experience with AWS, the Payments Engineering team has re-engineered their suite of applications into the cloud. It’s the first time payments are processed from a public cloud solution at this scale.
This presentation gives more information about the technical design of this new solution, as well as the transition strategy that was adopted, for a seamless migration of more than 57 million subscribers.

Mat's team is hiring!
Senior Software Engineer in Test - Payments Platform

Architecture about Billing Workflows in the Cloud - Sangeeta Handa & John Brandy - Billing Engineering

At Billing we are at the crossroads, where we are half way still in our old data center and half way migrated to cloud. Billing Engineering has 2 major aspects - One is batch renewal of Netflix subscribing customers and other is the APIs that change the billing state of a Netflix customer  in some way. Our topic for discussion was how Billing engineering is managing its workflow for these APIs across different processes and teams in this scenario and technology stack we are using to accomplish this.

Sangeeta’s team is hiring!

Payment Analytics at Netflix - Shankar Vedaraman - Data Science Engineering, Payments

Netflix Product has been data driven since inception and payment processing at Netflix is no different. With more than 55M customers paying Netflix on a monthly basis, there is lots of data to analyze and recommend dynamic routing of transactions to maximize approval rates. At the meetup, Shankar Vedaraman, who leads the Payment Analytics Data Science and Engineering team, presented all the different payments business processes that his team focusses on and touched upon key analytical insights that his team provides.

Shankar's team is hiring!

Security for Billing & Payments - Poorna Udupi - Product and Application Security

Poorna Udupi who leads the Product and Application Security team at Netflix, spoke about making security consumable in the form of tools, libraries and self-service applications to enable developers attain a rapid velocity of feature delivery while simultaneously being secure. Specifically speaking to the audience of billing and payments enthusiasts, he discussed a few security techniques in detail: infrastructure segmentation, tokenization, utilization of big data for fraud and abuse detection, prevention and sanitization. He provided a lightning overview of some of the open source security projects contributed by his team such as Scumblr, Sketchy and others in the pipeline that focus on automating away security functions so that his team can focus on security feature experimentation and innovation.

Poorna's team is hiring!

Escape from PCI Land - Rahul Dani - Growth Product Engineering

Rahul Dani, who leads the Growth Product Engineering team at Netflix, talked about the adventure in steering the middle tier signup apps out of PCI scope and into a PCI free environment.

Rahul's team is hiring!

Extracting contextual information from video assets



Here, I will describe our approach to extract contextual metadata from video assets to enable an improved Netflix user experience across the large catalog we serve.


Part 1: Detecting End-Sequences



When you finish watching a movie, we are able to provide a unique post-play experience as illustrated below in two examples. The user is presented with the next in a series of, or content similar to, the most recently seen video. Yet, the primary issue similarly remains isolating the salient parts of series and movies without the mind-boggling challenge of manually tagging the large and ever-changing catalog for the end points. In other words, we must devise a strategy for detecting when a video ends and the end-sequence begins. Interestingly, the end-sequence is unique in a few striking ways. First, that it appears at the end of the movie. Second, it almost always is comprised of text. Finally, there is very little variation between contiguous frames. Using all three of these conditions, we created an algorithm that successfully extracts the beginning of the end-sequence.  


Image may be NSFW.
Clik here to view.

Image may be NSFW.
Clik here to view.

Two examples of Netflix post-play experiences

Below you'll find an example of text-detected regions (highlighted with yellow rectangles) on the end-sequence of Orange is the New Black:

Image may be NSFW.
Clik here to view.
6974.jpg

Automated text detection of end sequence



Part 2: Detecting Similar Frames Across Multiple Video Assets

At Netflix, for a given video, we have several assets encoded for different countries and locales. There are many applications to detect similar frames across multiple video assets.


We extract visual fingerprints of a collection of certain frames. We can then use these fingerprints as comparative models- if similar frames appear in the rest of the videos, we can mark them as the ending of the start-sequence.


Let’s take an example: Let’s say Fig. 1A is the last frame of the title sequence of our favorite TV series. We'll call it our "Reference Frame," which we'll want to match with the rest of the episodes. In this case, we extracted an image histogram, to become our reference frame, as a marker of the fingerprint. Now, we will compare this fingerprint with another episode (Fig. 1B) of the same series. Given that both fingerprints are similar, we can walk through the rest of the episodes to mark them as identical/similar frames. Besides detecting the start sequence, this approach can be used to other interesting points within video.
Image may be NSFW.
Clik here to view.
Screen Shot 2015-04-01 at 8.50.23 AM.png
Histogram based fingerprints of video frames


Summary



Here, we have outlined two classes of algorithms that allow us to efficiently extract metadata of video assets allowing us to create a unique, uninterrupted viewing experience at Netflix.


If you have great or innovative ideas come join us on the Content Platform Engineering team!

Introducing Vector: Netflix's On-Host Performance Monitoring Tool

Image may be NSFW.
Clik here to view.
Vector is an open source host-level performance monitoring framework, which exposes hand-picked, high-resolution system and application metrics to every engineer’s browser. Having the right metrics available on demand and at a high resolution is key to understanding how a system behaves and correctly troubleshooting performance issues. Previously, we'd login to instances as needed, run a variety of commands, and sift through the output for the metrics that matter. Vector cuts down the time to get to those metrics, helping us respond to incidents more quickly.

Vector provides a simple way for users to visualize and analyze system and application-level metrics in near real-time. It leverages the battle tested open source system monitoring framework, Performance Co-Pilot (PCP),  layering on top a flexible and user-friendly UI. The UI polls metrics at up to 1 second resolution, rendering the data in completely configurable dashboards that simplify cross-metric correlation and analysis.

PCP’s stateless model makes it lightweight and robust. Its overhead on hosts is negligible, as clients are responsible for keeping track of state, sampling rate, and computation. Additionally, metrics are not aggregated across hosts or persisted outside of the user’s browser session, keeping the framework light. Vector requires only your local browser and PCP installed on the host you wish to monitor. No intermediate collector, server, or database infrastructure is required.

We are excited to release Vector to the community and look forward to feedback and collaboration!

High-Level Architecture

Vector itself is a web application that runs completely inside the user's browser. It was built with AngularJS and leverages D3.js for charts. In the future, the Vector package will also include custom metric agents.

Vector has a “default” dashboard exposed at launch.  This dashboard is a simple page that holds a few options including UI object visibility flags, widget definitions, and a set of loaded widgets. Once loaded, it will display the set of loaded widgets and present the user with controls to include any of the additional predefined widgets.

Widgets are loaded into dashboards. A widget object will contain details about a specific widget, like its name, template, style, and more importantly, the data model to be used. Data models are, in a nutshell, objects that control the metrics required for each widget and how the values are used in it. Data model prototypes are relatively simple. They extend a base WidgetDataModel prototype and define their own init and destroy functions. Most of what is done in those functions is adding and removing metrics from the metric poller list, creating callback functions that deal with the data points returned from the poller itself, and referencing the right data structure to be used in the charts.

Generic data models were also created so they could be reused on new widgets without having to create a specific data model for it. More details about the data models can be found on Vector's wiki page.

Metrics are polled from Performance Co-Pilot's web daemon. They are referenced by unique names and current values are returned with a timestamp in order for them to be normalized.Vector makes use of two data structures to store metrics and their values. The "raw" metric data structure holds the original metric values that came from PCP. The "derived" metric data structure holds metrics that were modified by a data model function, like a cumulative function or a normalization function.

The metric poller is the component that goes over the list of "raw" metrics and polls them from PCP via HTTP, given the selected polling interval. It also executes all data model functions and consequently updates the "derived" metric data structure. Charts are automatically updated every time the data structure is updated.

Performance Co-Pilot (PCP) is a system performance and analysis framework. It provides metric agents, a metric collector and a web daemon that is leveraged by the metric poller to collect metric values. More details about PCP can be found at pcp.io.

Getting Started


In order to get started, you should first install Performance Co-Pilot (PCP) on each host you plan to monitor. PCP will collect the metrics and make them available for Vector. The pmcd and pmwebd services need to be running on each host, the latter of which needs to expose its tcp/44323 network port.

Optional monitoring agents can also be installed in order to collect specific metrics that are not supported by PCP's system agent.

Once PCP is installed, you should be able to run Vector and connect to the target host.

Performance Co-Pilot (PCP)

Vector depends on Performance Co-Pilot (PCP) to collect metrics on each host you plan to monitor.

Since Vector depends on version 3.10 or higher, the packages currently available on most Linux distro repositories would not suffice. Until newer versions are available in the repositories, you should be able to install PCP from binary packages made available by the PCP development team on:

ftp.pcp.io

Or build it from source. To do so, get the current version of the source code:

$ git clone git://git.pcp.io/pcp

Then build and install:

$ cd pcp
$ ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var
$ make
$ sudo make install

More information on how to install Performance Co-Pilot can be found at:


Vector

Vector is a static web application that runs inside the client's browser. It can run locally or deployed to any HTTP server available, like Apache or Nginx.

To run in locally, first clone the repo:

$ git clone https://github.com/Netflix/vector.git

Make sure you have Bower installed on your system. Bower is a package management system for client-side programming, optimized for the front-end development.


And install all dependencies:

$ cd Vector
$ bower install

You can run Vector with Gulp. Gulp is an automated task runner and includes a development web server with live reload. In order to start Gulp’s web server, first make sure you have Gulp installed on your system:


Then, install all dependencies and execute the default Gulp task:

$ npm install
$ gulp

You can also run Vector with Python's SimpleHTTPServer:

$ cd Vector
$ python -m SimpleHTTPServer 8080

Then open Vector on your browser:

http://localhost:8080

And enter the hostname from the server you plan on monitoring. That's it!

Widgets & Dashboards
Vector's UI is based on dashboards and widgets. You can have one dashboard per browser tab/window. Dashboards are completely configurable and can have multiple widgets. Currentl.y there are no limits on the amount of widgets a dashboard can contain, but real-time rendering of multiple charts can consume a significant amount of CPU and slow down the application. Currently, changes made to dashboards are not persisted.

Window & Interval
Vector's UI aims to be extremely simple. Besides the hostname, there are only two configuration options, window and interval. The window option allows the user to select the rolling window size, represented in minutes, for all widgets in a dashboard. The interval option allows the user to select the metric polling interval, represented in seconds. If you have many widgets in a dashboard and the application starts to show signs of slowness, you should be able to decrease the window size and/or increase the interval to reduce CPU utilization.

Dashboards & Widgets

Vector comes with a predefined set of widgets and dashboards that can be easily extended. Here is a short list of metrics available by default.

CPU

  • Load Average
  • Runnable
  • CPU Utilization
  • Per-CPU Utilization
  • Context Switches

Memory

  • Memory Utilization
  • Page Faults

Disk

  • Disk IOPS
  • Disk Throughput
  • Disk Utilization
  • Disk Latency

Network

  • Network Drops
  • TCP Retransmits
  • TCP Connections
  • Network Throughput
  • Network Packets
Currently, there are only two pre-configured dashboards on Vector. The "default" dashboard, with a set commonly used widgets, and an empty dashboard. To change dashboards, click on the "widget" drop-down menu and select the desired dashboard.

Next Steps

  • More widgets and dashboards
  • User-defined dashboards
  • Metric snapshots
  • CPU Flame Graphs
  • Disk Latency Heat Maps
  • Integration with Servo
  • Support for Cassandra

Conclusion

Observability is key to understanding how an application behaves under certain conditions and is paramount to successfully troubleshoot any performance issue. Vector allows us to closely monitor hosts in near real-time and easily correlate metrics, making them accessible to every engineer, simplifying the process of troubleshooting issues. It proved to be an invaluable tool to help us achieve great performance and we plan to continue building and improving it!

You can find Vector on GitHub and on netflix.github.io!


Learning a Personalized Homepage



As we've described in our previous blog posts, at Netflix we use personalization extensively and treat every situation as an opportunity to present the right content to each of our over 57 million members. The main way a member interacts with our recommendations is via the homepage, which they see when they log into Netflix on any supported device. The primary function of the homepage is to help each member easily find something to watch that they will enjoy. A problem we face is that our catalog contains many more videos than can be displayed on a single page and each member comes with their own unique set of interests. Thus, a general algorithmic challenge becomes how to best tailor each member's homepage to make it relevant, cover their interests and intents, and still allow for exploration of our catalog.


This type of problem is not unique to Netflix, it is faced by others such as news sites, search engines, and online stores. Any site that needs to choose items from a large number of available possibilities and then present them in a coherent and easy-to-navigate manner will face the same general challenges. Of course, the problem of optimizing Netflix homepages has its own unique aspects, ranging from interface constraints to differences with how movies and TV are consumed compared to other media.


Image may be NSFW.
Clik here to view.
An example of a personalized Netflix homepage on our website.


Currently, the Netflix homepage on most devices is structured with videos (movies and TV shows) organized into thematically coherent rows presented in a two-dimensional layout. Members can scroll either horizontally on a row to see more videos in that row or vertically to see other rows. Thus, a key part of our personalization approach is how we choose rows to display on the homepage. This involves figuring out how to select the rows most relevant to each member, how to populate those rows with videos, and how to arrange them on the limited page area such that selecting a video to watch is intuitive. In the rest of this post, we will highlight what we think are the most relevant and interesting aspects of this problem and how we can go about solving some of them.


Before going on, it is worth mentioning that at Netflix we have a multitude of algorithms for doing personalization and recommendation including: how to predict the rating that a member will give a video, how to rank videos in each row, and how to create meaningful groupings of videos. Thus, in some sense, personalized page generation represents the next logical step in the evolution of our recommendation system that started with rating prediction and subsequently evolved into personalized ranking of our entire catalog. It involves solving a more general problem: how best to populate a personalized two-dimensional page of content, including recommendations.

Image may be NSFW.
Clik here to view.
Evolution of our personalization approach.


Why Rows Anyway?


We organize our homepage into a series of rows to make it easy for members to navigate through a large portion of our catalog. By presenting coherent groups of videos in a row, providing a meaningful name for each row, and presenting rows in a useful order, members can quickly decide whether a whole set of videos in a row is likely to contain something that they are interested in watching. This allows members to either dive deeper and look for more videos in the theme or to skip them and look at another row. This would not be the case if, for example, the page contained a large, unorganized collection of relevant videos.


Image may be NSFW.
Clik here to view.
Screen Shot 2015-02-17 at 3.24.11 PM.png
A possible row of titles that might be watched by one of our Netflix original characters.


One natural way to group videos is by genre or sub-genre or other video metadata dimensions like release date. Of course, the relationship between videos in a row does not have to be due to metadata alone, but can also be formed from behavioral information (for example from collaborative filtering algorithms), videos we think a member is likely to watch, or even groups of videos watched by a friend. Thus, each row can offer a unique and personalized slice of the catalog for a member to navigate. Part of the challenge and fun of creating a personalized homepage is figuring out new ways to create useful groupings of videos, which we are constantly experimenting with (e.g., rows of titles that might be watched by one of our Netflix original characters shown above).


Image may be NSFW.
Clik here to view.
Process for creating and choosing rows.


Once we have a set of possible video groups to consider for a page, we can begin to assemble the homepage from them. To do this, we start by finding candidate groupings that are likely relevant for a member based on the information we know about them. This also involves coming up with the evidence (or explanations) to support the presentation of a row, for example the movies that the member has previously watched in a genre. Next, we filter each group to handle concerns like maturity rating or to remove some previously watched videos. After filtering, we rank the videos in each group according to a row-appropriate ranking algorithm, which produces an ordering of videos such that the most relevant videos for the member in a group are at the front of the row. From this set of row candidates we can then apply a row selection algorithm to assemble the full page. As the page is assembled, we do additional filtering like deduplication to remove repeat videos and format rows to the appropriate size for the device.  


Page-level algorithmic challenge



Image may be NSFW.
Clik here to view.


To algorithmically create a good personalized homepage means assembling one page per member profile and device from thousands of videos that may be relevant for a member and from easily tens of thousands of potential rows, each with a variable number of videos. On top of that, we need to balance several factors that often compete for precious screen real estate. Our approach to personalization and recommendation largely focuses on helping our members find something new to watch, which we call discovery. However, we also want to make it easy for a member to watch the next episode of a show or re-watch something that they watched in the past, which normally falls outside the realm of recommendation. We want our recommendations to be accurate in that they are relevant to the tastes of our members, but they also need to be diverse so that we can address the spectrum of a member’s interests versus only focusing on one. We want to be able to highlight the depth in the catalog we have in those interests and also the breadth we have across other areas to help our members explore and even find new interests. We want our recommendations to be fresh and responsive to the actions a member takes, such as watching a show, adding to their list, or rating; but we also want some stability so that people are familiar with their homepage and can easily find videos they’ve been recommended in the recent past. Finally, we need to be able to place task-oriented rows, such as “My List,” in amongst the more discovery-oriented rows.


Each device has different hardware capabilities that can limit the number of videos or rows displayed at any one time and how big the whole page can be. As such, the page generation process must be aware of the constraints of the device for which it is creating the page, including the number of rows, the minimum and maximum length of a row, the size of the visible portion of the page, and whether or not certain rows are required or are not applicable for a certain device.


While there are many challenges to page generation, tackling recommendation problems at this level also opens up new solutions. As mentioned before, selecting a diverse set of items is important in a recommendation system. However, it can be challenging to navigate a diverse ranking since the relevant items may be blended with other items that do not match someone’s current intent. However, by presenting a two-dimensional navigation layout, a member can scroll vertically to easily skip over entire groups of content that may not match their current intent and then find a more relevant set, which they can then scroll horizontally to see more recommendations in that set. This allows for coherent, meaningful individual rows to be selected while maintaining the diversity of the videos shown on the whole page, and thus lets the member have both relevance and diversity.


Building a page algorithmically



There are several approaches for how we can build our homepage algorithmically. The most basic is a rule-based approach, which we used for a long time. Here a set of rules define a template that dictates for all members what types of rows can go in certain positions on the page. For example, the rules could specify that the first row would be Continue Watching (if any), then Top Picks (if any), then Popular on Netflix, then 5 personalized genre rows, and so on. The only personalization in this approach was from selecting candidate rows in a personalized way, such as including “Because you watched <video>” rows for videos someone has watched in the past and genre rows based on known genre preferences. To choose specific rows within each type, simple heuristics and sampling were used. We evolved this template using A/B testing to understand where to place rows for all members.


This approach served us well, but it ignored many aspects we consider important for the quality of the page, such as the quality of the videos in the row, the amount of diversity on the page, the affinity of members for specific kinds of rows, and the quality of the evidence we can surface for each video. It also made it hard to add new types of rows, because for a new row to succeed it would need to not only contain a relevant set of videos in a good order but also be placed appropriately in the template. Because of this, the rules for the template grew over time and became too complex to handle the variety of rows and how they should all be placed, which represented a local optimum for the member experience.


To address these issues, we can instead think of personalizing the ordering of rows on the homepage. The simplest approach for doing this is to treat rows as items in a ranking problem, which we call a row-ranking approach. For this approach, we could leverage a lot of existing recommendation or learning-to-rank approaches by developing a scoring function for rows, applying it to all the candidate rows independently, sorting by that function, and then picking the top ones to fill the page. Even though the space of rows may be relatively big, this type of approach could be relatively fast and may result in reasonable accuracy. However, doing this would lack any notion of diversity, so someone could easily get a page full of slight variations of their interests, such as many rows each with different variants of comedies: late-night, family, romantic, action, etc.


A simple way to add in diversity is to switch from a row-ranking approach to a stage-wise approach using a scoring function that considers both a row as well as its relationship to both the previous rows and the previous videos already chosen for the page. In this case, one can take a simple greedy approach and pick the row that maximizes this function as the next row to use and then re-score all the rows for the next position taking that selection into account. Depending on the diversity function, this greedy selection may not lead to an optimal page.  Using a stage-wise approach with k-row lookahead could result in a more optimal page than greedy selection, but it comes with increased computational cost. Other approaches to greedily add diversity based on submodular function maximization can also be used.


However, even the stage-wise algorithm is not guaranteed to produce an optimal page because a fixed horizon may limit the ability to fill in better rows further down the page. Thus, if we can instead take a page-wise approach by defining a full-page scoring function, we can try to optimize it by choosing rows and videos appropriately to fill the page. Of course, the space of possible pages is huge, even larger than the space of possible rows. Since a page layout is defined in a discrete space, directly optimizing a function that defines the quality of the whole page is a computationally prohibitive integer programming problem.


When solving a page optimization problem with any of these approaches, there are also various constraints that need to be taken into account that were mentioned before, like deduping, filtering, and device-specific constraints. Each of these constraints add to the complexity of the optimization problem.

Image may be NSFW.
Clik here to view.
Notional importance of navigation modeling. Members are more likely to vertically than horizontally, which means videos presented in the upper left are much more likely to be seen than those in the lower right.


When forming the homepage it is also important to consider how members navigate the page, i.e., to consider which positions on the page they are likely to pay attention to and interact with in a session. Placing the most relevant videos in the positions that are most likely to be seen, which tends to be the upper-left corner, should reduce the time for a member to find something relevant to watch. However, modeling navigation on a two-dimensional page is difficult, especially taking into account that different people may navigate differently, people’s navigation patterns may change over time, there are differences in navigation across different device types based on the interaction design, and that navigation is clearly dependent on the relevance of the content shown. With an accurate navigation model, we can inform better placement of videos and rows and where on the page to focus on relevance as opposed to diversity.  


Machine Learning for page generation



At the core of building a personalized page is a scoring function that can evaluate the quality of a row or a page. While we could use heuristics or intuition for building such a scoring function and tune it using A/B testing, we prefer to learn a good function from the data so that we can easily incorporate new data sources and balance the various different aspects of a homepage. To do this, we can use a machine learning approach to create the scoring function by training it using historical information of which homepages we have created for our members, what they actually see, how they interact, and what they play.


There is a large set of features that we could potentially use to represent a row for our learning algorithms. Since rows contain a set of videos, we can use any features of those videos in the row representation, either by aggregating across the row or indexing them by position. These features can be simple metadata or more useful model-based features that represent how good of a recommendation we believe a specific video is for a member. Of course, we have many different recommendation approaches, so we can include them as different features to learn an ensemble of them at the page level. We can also look at the quality of the evidence associated with the row, such as how much support there is for a member being interested in a specific genre. We can also look at past interactions with the row to see if that row or similar such rows have been consumed in the past by the member. We can also add simple descriptive features like how many videos are in a row, in what position a row is being placed on a page, or how often we’ve shown the row in the past. Diversity can also be additionally incorporated into the scoring model when considering the features of a row compared to the rest of the page by looking at how similar the row is to the rest of the rows or the videos in the row to the videos on the rest of the page.


While the space of potentially useful features is quite large, there are several challenges with training machine learning models for scoring rows. One challenge is dealing with presentation bias, where a member can only play from a row on the homepage that we’ve chosen to display, which can have a huge impact on the training data. To further complicate things, the position of a row on the page can greatly affect whether a member actually sees the row and then chooses to play from it. To handle these presentation and position biases, we need to be extremely careful about how we select training data for our algorithms. There is also a challenge around how attribution is allowed in the model; a video may have been played in a certain row in the past, but does that mean the member would have chosen that same video if it was placed in a different row but in the first position? Perhaps the title of a row being “Critically Acclaimed Documentaries” was responsible for play where it may not have been selected without that additional evidence, for example, in a “New Releases” row, even if it was in a better position. Learning over features to represent diversity can also be challenging because while the space of potential rows at different positions on the page is large, when the rest of the page (or the already chosen rows) is taken into account for diversity, the space of possible pages is even larger.

Image may be NSFW.
Clik here to view.


Page-level metrics



To deal with these challenges, as with any algorithmic approach, choosing a good metric is important. Of fundamental importance in page generation is how to evaluate the quality of the pages produced by a specific algorithm during offline experimentation. While we ultimately will test any potential algorithmic improvement online in an A/B test, we would like to be able to focus our precious A/B testing resources on algorithms that we have evidence are likely to improve the quality of the pages. We also need to be able to tune the parameters of those algorithms before A/B testing. To do this, we can use historical data to generate hypothetical pages from new algorithmic approaches, provided we can choose a good metric for page quality.

Image may be NSFW.
Clik here to view.
Example of two-dimensional recall metrics.  For each page variant, the fractions on the side represent the recall at 1-by-3, 2-by-3, and 3-by-3 metrics, respectively.


To come up with page-level quality metrics, we took inspiration from ranking metrics that are common in information retrieval (many of which exist in the literature) for a one-dimensional list and created ones that work over a two-dimensional layout. For instance, consider a simple metric like Recall@n, which measures the number of relevant items in the top n divided by the total number of relevant items. We can extend it in two dimensions to be Recall@m-by-n, where now we count the number of relevant items in first m rows and n columns on the page divided by the total number of relevant items. Thus, Recall@3-by-4 may represent quality of videos displayed in the viewport on a device that initially can show 3 rows and 4 videos at a time. One nice property of recall defined this way is that it automatically can handle corner-cases like duplicate videos or short rows. We can also hold one of the values n (or m) fixed and sweep across the other to calculate, for instance, how the recall increases in the viewport as the member would scroll down the page.


Image may be NSFW.
Clik here to view.
Comparison of four page algorithms in recall up to a fixed column position while sweeping the row position. The red line is the previous rule-based approach and the blue is a personalized layout.


Of course, Recall is a basic metric and requires choosing values for m and n, but we can likewise extend metrics that assign a score or likelihood for a member seeing a position, like NDCG or MRR, to the two-dimensional case. We can also adapt navigation models like Expected Reciprocal Rank to incorporate two-dimensional navigation through the page and take into account the cascading aspect of browsing. With such page-level metrics defined, we can use them to evaluate changes in any of the algorithmic approaches used to generate the page, not just the algorithms for ordering the rows, but also the selection, filtering, and ranking algorithms, or any of the input data that they use.


Other challenges



There is no shortage of challenging questions that come up in engineering the homepage.  For example: When is it appropriate to take into account other context variables such as the time of the day or device, in how we populate the homepages? How do we find the appropriate trade-off between finding the optimal page and computational cost? How do we form the home pages during the critical first few sessions of a member, precisely at the time when we have the least information about them? We need to think about and weigh the importance of each of these questions every day in order to continually improve the Netflix homepages.

Conclusion



While Netflix may be most famous in the recommendations community for the Netflix prize, we think of personalized page generation as the next step in the evolution of our personalization approach from rating prediction to video ranking to now page generation. We have taken the initial step of coming up with our first algorithm for personalized page generation that showed significantly better online performance than our existing template, and deployed it last year. However, personalized page generation is a challenging problem that involves balancing a multitude of factors, and we think that this is just the beginning. There is a lot of potential to improve the homepages for all of our members and help them easily find content they will love.

We are always looking for talented researchers and engineers to join our team. So if you are interested in helping us solve these types of problems and increasing global happiness, please take a look at some of our open positions on the Netflix jobs page.

Introducing FIDO: Automated Security Incident Response



We're excited to announce the open source release of FIDO (Fully Integrated Defense Operation - apologies to the FIDO Alliance for acronym collision), our system for automatically analyzing security events and responding to security incidents.

Overview

The typical process for investigating security-related alerts is labor intensive and largely manual. To make the situation more difficult, as attacks increase in number and diversity, there is an increasing array of detection systems deployed and generating even more alerts for security teams to investigate.

Netflix, like all organizations, has a finite amount of resources to combat this phenomenon, so we built FIDO to help. FIDO is an orchestration layer that automates the incident response process by evaluating, assessing and responding to malware and other detected threats.

The idea for FIDO came from a simple proof of concept a number of years ago. Our process for handling alerts from one of our network-based malware systems was to have a help desk ticket created and assigned to a desktop engineer for follow-up - typically a scan of the impacted system or perhaps a re-image of the hard drive. The time from alert generation to resolution of these tickets spanned from days to over a week. Our help desk system had an API, so we had a hypothesis that we could cut down resolution time by automating the alert-to-ticket process. The simple system we built to ingest the alerts and open the tickets cut the resolution time to a few hours, and we knew we were onto something - thus FIDO was born.

Architecture and Operation

This section describes FIDO's operation, and the following diagram provides an overview of FIDO’s architecture.

Image may be NSFW.
Clik here to view.



Detection

FIDO’s operation begins with the receipt of an event via one of FIDO’s detectors. Detectors are off the shelf security products (e.g. firewalls, IDS, anti-malware systems) or custom systems that detect malicious activities or threats. Detectors generate alerts or messages that FIDO ingests for further processing. FIDO provides a number of ways to ingest events, including via API (the preferred method), SQL database, log file, and email. FIDO supports a variety of detectors currently (e.g. Cyphort, ProtectWise, CarbonBlack/Bit9) with more planned or under development.

Analysis and Enrichment

The next phase of FIDO operation involves deeper analysis of the event and enrichment of the event data with both internal and external data sources. Raw security events often have little associated context, and this phase of operation is designed to supplement the raw event data with supporting information to enable more accurate and informed decision making.

The first component of this phase is analysis of the event’s target - typically a computer and/or user (but potentially any targeted resource). Is the machine a Windows host or a Linux server? Is it in the PCI zone? Does the system have security software installed and the latest patches? Is the targeted user a Domain Administrator? An executive? Having answers to these questions allows us to better evaluate the threat and determine what actions need to be taken (and with what urgency). To gather this data, FIDO queries various internal data sources - currently supported are Active Directory, LANDesk, and JAMF, with other sources under consideration.

In addition to querying internal sources, FIDO consults external threat feeds for information relevant to the event under analysis. The use of threat feeds help FIDO determine whether a generated event may be a false positive or how serious and pervasive the issue may be. Another way to think of this step is ‘never trust, always verify.’ A generated alert is simply raw data - it must be enriched, evaluated, and corroborated before actioning. FIDO supports several threats feeds, including ThreatGrid and VirusTotal, with additional feeds under consideration.

Correlation and Scoring

Once internal and external data has been gathered about a given event and its target(s), FIDO seeks to correlate the information with other data it has seen and score the event to facilitate ultimate disposition. The correlation component serves several functions - first - have multiple detectors identified this same issue? If so, it could potentially be a more serious threat. Second - has one of your detectors already blocked or remediated the issue (for example - a network-based malware detector identifies an issue, and a separate host-based system repels the same item)? If the event has already been addressed by one of your controls, FIDO may simply provide a notification that requires no further action. The following image gives a sense of how the various scoring components work together.


Scoring is multi-dimensional and highly customizable in FIDO. Essentially, what scoring allows you to do is tune FIDO’s response to the threat and your own organization’s unique requirements. FIDO implements separate scoring for the threat, the machine, and the user, and rolls the separate scores into a total score. Scoring allows you to treat PCI systems different than lab systems, customer service representatives different than engineers, and new event sources different than event sources with which you have more experience (and perhaps trust). Scoring leads into the last phase of FIDO’s operation - Notification and Enforcement.

Notification and Enforcement

In this phase, FIDO determines and executes a next action based on the ingested event, collected data, and calculated scores. This action may simply be an email to the security team with details or storing the information for later retrieval and analysis. Or, FIDO may implement more complex and proactive measures such as disabling an account, ending a VPN session, or disabling a network port. Importantly, the vast majority of enforcement logic in FIDO has been Netflix-specific. For this reason, we’ve removed most of this logic and code from the current OSS version of FIDO. We will re-implement this functionality in the OSS version when we are better able to provide the end-user reasonable and scalable control over enforcement customization and actions.

Open Items & Future Plans

Netflix has been using FIDO for a bit over 4 years, and while it is meeting our requirements well, we have a number of features and improvements planned. On the user interface side, we are planning for an administrative UI with dashboards and assistance for enforcement configuration. Additional external integrations planned include PAN, OpenDNS, and SentinelOne. We're also working on improvements around correlation and host detection. And, because it's now OSS, you are welcome to suggest and submit your own improvements!

-Rob Fry, Brooks Evans, Jason Chan

Netflix Streaming - More Energy Efficient than Breathing

Netflix Streaming: Energy Consumption for 2014 was 0.0013 kWh per Streaming Hour Delivered

  • 36% was from renewable sources

  • 28% was offset with renewable energy credits

  • We plan to be fully offset by 2015, and to increase the contribution of renewable sources
  • Carbon footprint of about 300g of CO2 per customer represents about 0.007% of the typical US household footprint of 43,000 kg (48 tons) of CO2 per year


Since 2007 when Netflix launched its streaming service, usage has grown exponentially. Last quarter alone, our 60 million members collectively enjoyed 10 billion streaming hours worldwide.
Netflix streaming consumes energy in two main ways:
  1. The majority of our technology is operated in the Amazon Web Services (AWS) cloud platform. AWS offers us unprecedented global scale, hosting tens of thousands of virtual instances and many petabytes of data across several cloud regions.
  2. The audio-video media itself is delivered from “Open Connect” content servers, which are forward positioned close to, or inside of, ISP networks for efficient delivery.
In addition, energy is consumed by:
  1. The ISP networks, which carry the data across “the last mile” from our content servers to our customers.
  2. The “consumer premises equipment” (CPE) that includes cable or DSL modems, routers, WiFi access points, set-top boxes, and TVs, laptops, tablets, and phones.
First and foremost, we have focused on efficiency -- making sure that the technology we have built and use is as efficient as possible, which helps with all four components: those for which Netflix is responsible, and those associated with ISP operations and consumer choices.  Then we have focused on procuring renewables or offsets for the power that our own systems consume.

AWS Footprint

Because Netflix relies more heavily on AWS regions that are powered primarily by renewable energy (including the carbon-neutral Oregon region), our energy mix is approximately 50% from renewable sources today. We mitigate all of the remaining carbon emissions, which added up to approximately 10,200 tons of CO2e in 2014, by investing in renewable energy credits (RECs) in the geographic areas that host our cloud footprint; last year, the majority went to RECs for wind projects in North America, with the remainder going to Guarantees of Origin (GOs) for hydropower in Europe.
Purchasing renewable energy credits (RECs) allows us to be carbon-neutral in the cloud, but our main strategy is to be more efficient and consume less energy in the first place. Back in the data center days, long provisioning cycles and spikes in customer demand required us to maintain large capacity buffers that went unused most of the time: overall server utilization percentage was in the single digits. Thanks to the elasticity of the cloud, we are able to instantaneously grow and shrink our capacity along with customer demand, generally keeping our server utilization above 50%. This brought significant benefits to our bottom line (moving to the cloud reduced our server-side costs per streaming hour by 85%), but also allowed us to drastically improve our carbon efficiency.
Open Connect Footprint
Open Connect, the Netflix Content Delivery Network, was designed with power efficiency in mind. Today, the entirety of Netflix’s Content Delivery servers consume 1.4 Megawatts of power. While these servers are located in hundreds of locations across the globe, a majority of them are in major colocation vendors with similar interest as ours in ensuring a bright future for renewable energy.
As we have evolved Open Connect, we have reduced the energy consumption of our servers significantly. At our 2012 launch, we consumed nearly .6 watts per Megabit per second (Mbps) of peak capacity. In 2015, our flash-based servers consume less than .006 watts per Mbps, a 100X improvement. Those flash-based servers generate nearly 70% of Netflix’s global traffic footprint.
When choosing where to locate Open Connect CDN servers, sustainability is a key metric used to evaluate our potential partners. It’s important that our data center providers commit to 100% green power through RECs and that they continue to find new and innovative ways to become carbon neutral.  One such example is Equinix’s experiment with Bloom Energy fuel cells in its SV5 data center in San Jose, one of the facilities in which Netflix equipment is colocated.  Equinix recently announced a major initiative to adopt 100% clean and renewable energy across their global platform. We have a goal to work with datacenter operators to increase their use of renewable sources of power, and we expect to buy offsets for 100% of any power that is not from renewable sources for 2015 and beyond.
We estimate that our Open Connect servers used non-renewable power responsible for about 7,500 Tons of CO2e in 2014.

ISPs

While we don’t control the energy choices of ISPs, we have engineered our Open Connect media servers to minimize the requirements for routers, by providing routing technology as part of the package, so that an ISP who chooses to interconnect directly with Netflix can usually use a smaller, cheaper, and much more power-efficient switch instead of a router for bringing Netflix traffic onto their networks.  In some cases, avoiding the need for a router might eliminate three quarters of the power footprint of a particular deployment.

Consumer Premise Equipment

The energy footprint of the consumers’ home equipment (shared between various entertainment and computing uses in the consumers’ homes) dwarfs all the upstream elements by perhaps two orders of magnitude.  Our focus here has been to provide streaming technology for Smart TVs, set-top boxes, game consoles, tablets, phones, computers that is as efficient as possible.  For example, a big focus for the 2015 Smart TV platforms has been suspend and resume capabilities, which ensure that Netflix can be started quickly from a powered-down state, which helps TV manufacturers build energy-star compliant TVs that don’t waste energy while the user is not watching.  This is one of several components in our “Netflix Recommended TV” program.  Similarly, our choice of encoder technology takes into account the hardware acceleration capabilities of devices such as smart phones, tablets, and laptop graphics chips, which have the ability to reduce power consumption of video rendering, which might extend tablet battery life by 4x with matching reduction in total power consumption due to streaming activity.
A typical household watching Netflix might include 5W for the cable modem, 10W for the WiFi access point, and 100W for the Smart-TV.  115Wh of home power is responsible for about 70g CO2e for one hour of viewing.
We encourage our CE partners to make energy-wise designs, but ultimately the choices that customers make are also governed by their other home entertainment and computing needs and desires, and accordingly we don’t measure or attempt to offset those impacts.

Comparisons

In 2014, Netflix infrastructure generated only 0.5g of CO2e emissions for each hour of streaming. The average human breathing emits about 40g/hour, nearly 100x as much.  Sitting still while watching Netflix probably saves more CO2 than Netflix burns.
The amount of carbon equivalent emitted in order to produce a single quarter-pound hamburger can power Netflix infrastructure to enable viewing by 10 member families for an entire year!
A viewer who turned off their TV to read books would consume about 24 books a year in equivalent time, for a carbon footprint around 65kg CO2e - over 200 times more than Netflix streaming servers, while the 100W reading light they might we use would match the consumption of the TV they could have watched instead!

Localization Technologies at Netflix

The localization program at Netflix is centered around linguistic excellence, a great team environment, and cutting-edge technology. The program is only 4 years old, which for a company our size is unusual to find. We’ve built a team and toolset representative of the scope and scale that a localization team needs to operate at in 2015, not one that is bogged down with years of legacy process and technology, as is often the case.
We haven’t been afraid to experiment with new localization models and tools, going against localization industry norms and achieving great things along the way. At Netflix we are given the freedom to trailblaze.
In this blog post we’re going to take a look at two major pieces of technology we’ve developed to assist us on our path to global domination…
Netflix Global String Repository
Having great content by itself is not enough to make Netflix successful; how the content is presented has a huge impact. Having an intuitive, easy to use, and localized user interface (UI) contributes significantly to Netflix's success. Netflix is available on the web and on a vast number of devices and platforms including Apple iOS, Google Android, Sony PlayStation, Microsoft Xbox, and TVs from Sony, Panasonic, etc. Each of these platforms has their own standards for internationalization, and that poses a challenge to our localization team.
Here are some situations that require localization of UI strings:
- New languages are introduced
- New features are developed
- Fixes are made to current text data
Traditionally, getting UI strings translated is a high-touch process where a localization PM partners with a dev team to understand where to get the source strings from, what languages to translate them into, and where to deliver the final localized files. This gets further complicated when multiple features are being developed in parallel using different branches in Git.
Once translations are completed and the final files delivered, an application typically goes through a build, test and deploy process. For device UIs, a build might need additional approval from a third party like Apple. This causes unnecessary delays, especially in cases where a fix to a string needs to be rolled out immediately.
What if we can make this whole process transparent to the various stakeholders – developers, and localization? What if we can make builds unnecessary when fixes to text need to be delivered?
In order to answer those questions we have developed a global repository for UI strings, called Global String Repository, that allows teams to store their localized string data and pull it out at runtime. We have also integrated Global String Repository with our current localization pipeline making the whole process of localization seamless. All translations are available immediately for consumption by applications.
Global String Repository allows isolation through bundles and namespaces. A bundle is a container for string data across multiple languages. A namespace is a placeholder for bundles that are being worked upon. There is a default namespace that is used for publishing. A simple workflow would be:
  1. A developer makes a change to the English string data in a bundle in a namespace
  2. Translation workflows are automatically triggered
  3. Linguist completes the translation workflow
  4. Translations are made available to the bundle in the namespace
Applications have a choice when integrating with Global String Repository:
  • Runtime: Allows fast propagation of changes to UIs
  • Build time: Uses Global String Repository solely for localization but packages the data with the builds
Global String Repository allows build time integration by making all necessary localized data available through a simple REST API.
We expose the Global String Repository via the Netflix edge APIs and it is subjected to the same scaling and availability requirements as the other metadata APIs. It is a critical piece especially for applications that are integrating at runtime. With over 60 million customers, a large portion of whom stream Netflix on devices, Global String Repository is in the critical path.
True to the Netflix way, Global String Repository is comprised of a back-end microservice and a UI. The microservice is built as a Java web application using Apache Cassandra and ElasticSearch. It is deployed in AWS across 3 regions. We collect telemetry for every API interaction.
The Global String Repository UI is developed using Node.js, Bootstrap and Backbone and is also deployed in the AWS cloud.
On the client side, Global String Repository exposes REST APIs to retrieve string data and also offers a Java client with in-built caching.
While we have Global String Repository up and running, there is still a long way to go. Some of the things we are currently working on are:
- Enhancing support for quantity strings (plurals) and gender based strings
- Making the solution more resilient to failures
- Improving scalability
- Supporting multiple export formats (Android XML, Microsoft .Resx, etc)
The Global String Repository has no binding to Netflix's business domain, so we plan on releasing it as open source software.
Hydra
Netflix, as a soon-to-be global service, supports many locales across myriad of device/UI combinations; testing this manually just does not scale. Previously, members of the localization and UI teams would manually use actual devices, from game consoles to iOS and Android, to see all of these strings in context to test for both the content as well as any UI issues, such as truncations.
At Netflix, we think there is always a better way; with that attitude we rethought how we do in context, on device localization testing, and Hydra was born.
The motivation behind Hydra is to catalogue every possible unique screen and allow anyone to see a specific set of screens that they are interested in, across a wide range of filters including devices and locales. For example, as a German localization specialist you could, by selecting the appropriate filters, see the non-member flow in German across PS3, Website and Android. These screens can then be reviewed in a fraction of the time it would take to get to all of those different screens across those devices.
How Screens Reach Hydra
Hydra itself does not take any of the screens, it serves to catalogue and display them. To get screens into Hydra, we leverage our existing UI automation. Through Jenkins CI jobs, data driven tests are run in parallel across all supported locales, to take screenshots and post them screens to Hydra with appropriate metadata, including page name, feature area, major UI platform, and one critical piece of metadata, unique screen definition.
The purpose of the unique screen definition is to have a full catalogue of screens without any unnecessary overlap. This allows for fewer screens to be reviewed as well as for longer term to be able to compare a given screen against itself over time. The definition of a unique screen is different from UI to UI, for browser it is a combination of page name, browser, resolution, local and dev environment.
The Technology
Image may be NSFW.
Clik here to view.
hydraPost.jpg
Hydra is a full stack web application deployed to AWS. The Java based backend has two main functions, it processes incoming screenshots and exposes data to the frontend through rest APIs. When the UI automation posts a screen to Hydra, the image file itself is written to S3, allowing for more or less infinite storage, and the much smaller metadata is written to a RDS database so as to be queried later through the rest APIs. The rest endpoints provide a mapping of query string params to MySQL queries.
For example:
REST/v1/lists/distinctList?item=feature&selectors=uigroup,TVUI;area,signupwizard;locale,da-DK
This call would essentially map to this query to populate the values for the ‘feature’ filter:
select distinct feature where uigroup = ‘TVUI’ AND area = ‘signupwizard’ AND locale = ‘da-DK’
The JavaScript frontend, which leverages knockout.js, serves to allow users to select filters and view the screens that match those filters. The content of the filters as well as the screens that match the filters that are already selected are both provided by making calls to the rest endpoints mentioned above.
Allowing for Scale
With Hydra in place and the automation running, adding support for new locales becomes as easy as adding one line to an existing property file that feeds the testNG data provider. The screens in the new locale will then flow in with the next Jenkins builds that run.
Next Steps
One known improvement is to have a mechanism to know when a screen has changed. In its current state, if a string changes there is nothing that automatically identifies that a screen has changed. Hydra could evolve into more or less a work queue, localization experts could login and see only the specific set of screens that have changed.
Another feature would be to have the ability to map individual string keys map to which screens. This would allow a translator to change a string, and then search for that string key, and see the screens that are affected by that change. This allows the translator to be able to see that string change in context before even making it.
If what we’re doing here at Netflix with regards to localization technology excites you, please take a moment to review the open positions on our Localization Platform Engineering team:


We like big challenges and have no shortage of them to work on. We currently operate in 50 countries, by the end of 2016 that number will grow to 200. Netflix will be a truly global product and our localization team needs to scale to support that. Challenges like these have allowed us to attract the best and brightest talent, and we’ve built a team that can do what seems impossible.

NTS: Real-time Streaming for Test Automation

by Peter Hausel and Jwalant Shah

Netflix Test Studio



Netflix members can enjoy instant access to TV shows & Movies on over 1400 different device/OS permutations. Assessing long-duration playback quality and delivering a great member experience on such a diverse set of playback devices presented a huge challenge to the team.


Netflix Test Studio (NTS) was created with the goal of creating a consistent way for internal and external developers to deploy and execute tests. This is achieved by abstracting device differences. NTS also provides a standard set of tools for assessing the responsiveness and quality of the overall experience. NTS now runs over 40,000 long-running tests each day on over 600 devices around the world.


Overview


NTS is a cloud-based automation framework that lets you remote control most Netflix Ready Devices. In this post we’ll focus on two key aspects of the framework:
  • Collect test results in near-realtime.
    • A highly event driven architecture allows us to accomplish this: JSON snippets sent from the single page UI to the device and JavaScript listeners on the device firing back events. We also have a requirement to be able to play back events as they happened, just like a state machine.
  • Allow testers to interact with both the device and various Netflix services during execution.
    • Integrated tests require the control of the test execution stream in order to simulate real-world conditions. We want to simulate failures, pause, debug and resume during test execution.


A typical user interface for Test Execution using NTS














A Typical NTS Test:

Image may be NSFW.
Clik here to view.


Architecture overview

Early implementation of NTS had a relatively simplistic design: hijack a Netflix Ready Device for automation via various redirection methods, then a Test Harness (test executor) would coordinate the execution with the help of a central, public facing Controller service. Eventually, we would get data out from the device via long polling, validate steps, and bubble up validation results back to the client. We built separate clusters of this architecture for each Netflix SDK version.

Original Architecture using Long Polling

Image may be NSFW.
Clik here to view.
nts_legacy_with devices.png

Event playback is not supported

This model worked relatively well in the beginning. However, as the number of supported devices, SDK’s and test cases grew, we started seeing the limitations of this approach: messages were sometimes lost, there was no way of knowing what exactly happened, error messages were misleading, tests were hard to monitor and playback real-time, finally, maintaining almost identical clusters with different test content and SDK versions was introducing an additional maintenance burden as well.

In the next iteration of the tool, we removed the Controller service and most of the polling by introducing a WebSockets proxy (built on top of JSR-356) that was sitting between the clients and Test Executors. We also introduced JSON-RPC as the command protocol.

Updated Version - Near-Realtime (Almost There)

Image may be NSFW.
Clik here to view.
nts_mono_devices.png

Pub/Sub without event playback support

  • Test Executor submits events in a time series fashion to a Websocket Bus which terminates at Dispatcher.
  • Client connects to a Dispatcher with session Id information. One-to-many relationship between Dispatcher and TestExecutors.
  • Dispatcher instance keeps an internal lookup of test execution session id’s to Websocket connections to Test Executors and delivers messages received over those connections to the Client.

This approach solved most of our issues: fewer indirections, real-time streaming capabilities, push-based design. There were only two remaining issues: message durability was still not supported and more importantly, the WebSockets proxy was difficult to scale out due to its stateful nature.

At this point, we started looking into Apache Kafka to replace the internal WebSocket layer with a distributed pub/sub and message queue solution.

Current version - KafkaImage may be NSFW.
Clik here to view.
nts_kafka_messaging_devices.png
Pub/Sub with event playback support

A few interesting properties of this pub/sub system:
  • Dispatcher is responsible for handling client requests to subscribe to Test Execution events stream.
  • Kafka provides a scalable message queue between Test Executor and Dispatcher. Since each session id is mapped to a particular partition and each message sent to client includes the current Kafka offset, we can now guarantee reliable delivery of messages to clients with support for replay of messages in case of network reconnection.
  • Multiple clients can subscribe to the same stream without additional overhead and admin users can view/monitor remote users test execution in real time.
  • The same stream is consumed for analytics purposes as well.
  • Throughput/Latency: during load testing, we could get ~90-100ms latency per message consistently with 100 concurrent users (our test setup was 6 brokers deployed on 6 d2.xlarge instances). In our production system, latency is often lower due to batching.

Where do we go from here?

With HTTP/2 on the horizon, it’s unclear where WebSockets will fit in the long-run. That said, if you need a TCP-based, persistent channel now, you don’t have a better option. While we are actively migrating away from JSR-356 (and Tomcat Websocket) to RxNetty due to numerous issues we ran into, we continue to invest more in WebSockets.

As for Kafka, the transition was not problem free either. But Kafka solved some very hard problems for us (distributed event bus, message durability, consuming a stream both as a distributed queue and pub/sub etc.) and more importantly, it opened up the door for further decoupling. As a result, we are moving forward with our strategic plan to use this technology as the unified backend for our data pipeline needs.

(Engineers who worked on this project: Jwalant Shah, Joshua Hua, Matt Sun)

Tracking down the Villains: Outlier Detection at Netflix

It’s 2 a.m. and half of our reliability team is online searching for the root cause of why Netflix streaming isn’t working. None of our systems are obviously broken, but something is amiss and we’re not seeing it. After an hour of searching we realize there is one rogue server in our farm causing the problem. We missed it amongst the thousands of other servers because we were looking for a clearly visible problem, not an insidious deviant.

In Netflix’s Marvel’s Daredevil, Matt Murdock uses his heightened senses to detect when a person’s actions are abnormal. This allows him to go beyond what others see to determine the non-obvious, like when someone is lying. Similar to this, we set out to build a system that could look beyond the obvious and find the subtle differences in servers that could be causing production problems. In this post we’ll describe our automated outlier detection and remediation for unhealthy servers that has saved us from countless hours of late-night heroics.

Shadows in the Glass

The Netflix service currently runs on tens of thousands of servers; typically less than one percent of those become unhealthy. For example, a server’s network performance might degrade and cause elevated request processing latency. The unhealthy server will respond to health checks and show normal system-level metrics but still be operating in a suboptimal state.

A slow or unhealthy server is worse than a down server because its effects can be small enough to stay within the tolerances of our monitoring system and be overlooked by an on-call engineer scanning through graphs, but still have a customer impact and drive calls to customer service. Somewhere out there a few unhealthy servers lurk among thousands of healthy ones.

Image may be NSFW.
Clik here to view.
NIWSErrors - hard to see outlier (can you spot).png
The purple line in the graph above has an error rate higher than the norm. All other servers have spikes but drop back down to zero, whereas the purple line consistently stays above all others. Would you be able to spot this as an outlier? Is there a way to use time series data to automatically find these outliers?

A very unhealthy server can easily be detected by a threshold alert. But threshold alerts require wide tolerances to account for spikes in the data. They also require periodic tuning to account for changes in access patterns and volume. A key step towards our goal of improving reliability is to automate the detection of servers that are operating in a degraded state but not bad enough to be detected by a threshold alert.
Image may be NSFW.
Clik here to view.
outlier-just-above-the-noise-caption.png

Finding a Rabbit in a Snowstorm

To solve this problem we use cluster analysis, which is an unsupervised machine learning technique. The goal of cluster analysis is to group objects in such a way that objects in the same cluster are more similar to each other than those in other clusters. The advantage of using an unsupervised technique is that we do not need to have labeled data, i.e., we do not need to create a training dataset that contains examples of outliers. While there are many different clustering algorithms, each with their own tradeoffs, we use Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to determine which servers are not performing like the others.

How DBSCAN Works

DBSCAN is a clustering algorithm originally proposed in 1996 by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu. This technique iterates over a set of points and marks as clusters points that are in regions with many nearby neighbors, while marking those in lower density regions as outliers. Conceptually, if a particular point belongs to a cluster it should be near lots of other points as measured by some distance function. For an excellent visual representation of this see Naftali Harris’ blog post on visualizing DBSCAN clustering.

How We Use DBSCAN

To use server outlier detection, a service owner specifies a metric which will be monitored for outliers. Using this metric we collect a window of data from Atlas, our primary time series telemetry platform. This window is then passed to the DBSCAN algorithm, which returns the set of servers considered outliers. For example, the image below shows the input into the DBSCAN algorithm; the red highlighted area is the current window of data:
Image may be NSFW.
Clik here to view.
In addition to specifying the metric to observe, a service owner specifies the minimum duration before a deviating server is considered an outlier. After detection, control is handed off to our alerting system that can take any number of actions including:

  • email or page a service owner
  • remove the server from service without terminating it
  • gather forensic data for investigation
  • terminate the server to allow the auto scaling group to replace it

Parameter Selection

DBSCAN requires two input parameters for configuration; a distance measure and a minimum cluster size. However, service owners do not want to think about finding the right combination of parameters to make the algorithm effective in identifying outliers. We simplify this by having service owners define the current number of outliers, if there are any, at configuration time. Based on this knowledge, the distance and minimum cluster size parameters are selected using simulated annealing. This approach has been effective in reducing the complexity of setting up outlier detection and has facilitated adoption across multiple teams; service owners do not need to concern themselves with the details of the algorithm.

Into the Ring

To assess the effectiveness of our technique we evaluated results from a production service with outlier detection enabled. Using one week’s worth of data, we manually determined if a server should have been classified as an outlier and remediated. We then cross-referenced these servers with the results from our outlier detection system. From this, we were able to calculate a set of evaluation metrics including precision, recall, and f-score:

Server Count
Precision
Recall
F-score
1960
93%
87%
90%

These results illustrate that we cannot perfectly distill outliers in our environment but we can get close. An imperfect solution is entirely acceptable in our cloud environment because the cost of an individual mistake is relatively low. Erroneously terminating a server or pulling one out of service has little to no impact because it will be immediately replaced with a fresh server.  When using statistical solutions for auto remediation we must be comfortable knowing that the system will not be entirely accurate; an imperfect solution is preferable to no solution at all.

The Ones We Leave Behind

Our current implementation is based on a mini-batch approach where we collect a window of data and use this to make a decision. Compared to a real-time approach, this has the drawback that outlier detection time is tightly coupled to window size: too small and you’re subject to noise, too big and your detection time suffers. Improved approaches could leverage advancements in real-time stream processing frameworks such as Mantis (Netflix's Event Stream Processing System) and Apache Spark Streaming. Furthermore, significant work has been conducted in the areas of data stream mining and online machine learning. We encourage anyone looking to implement such a system to consider using online techniques to minimize time to detect.

Parameter selection could be further improved with two additional services: a data tagger for compiling training datasets and a model server capable of scoring the performance of a model and retraining the model based on an appropriate dataset from the tagger. We’re currently tackling these problems to allow service owners to bootstrap their outlier detection by tagging data (a domain in which they are intimately familiar) and then computing the DBSCAN parameters (a domain that is likely foreign) using a bayesian parameter selection technique to optimize the score of the parameters against the training dataset.

World on Fire

As Netflix’s cloud infrastructure increases in scale, automating operational decisions enables us to improve availability and reduce human intervention. Just as Daredevil uses his suit to amplify his fighting abilities, we can use machine learning and automated responses to enhance the effectiveness of our site reliability engineers and on-call developers.  Server outlier detection is one example of such automation, other examples include Scryer and Hystrix. We are exploring additional areas to automate such as:

  • Analysis and tuning of service thresholds and timeouts
  • Automated canary analysis
  • Shifting traffic in response to region-wide outages
  • Automated performance tests that tune our autoscaling rules

These are just a few example of steps towards building self-healing systems of immense scale. If you would like to join us in tackling these kinds of challenges, we arehiring!

Java in Flames

Java mixed-mode flame graphs provide a complete visualization of CPU usage and have just been made possible by a new JDK option: -XX:+PreserveFramePointer. We've been developing these at Netflix for everyday Java performance analysis as they can identify all CPU consumers and issues, including those that are hidden from other profilers.

Example
This shows CPU consumption by a Java process, both user- and kernel-level, during a vert.x benchmark:
Image may be NSFW.
Clik here to view.

Click to zoom (SVG, PNG). Showing all CPU usage with Java context is amazing and useful. On the top right you can see a peak of kernel code (colored red) for performing a TCP send (which often leads to a TCP receive while handling the send). Beneath it (colored green) is the Java code responsible. In the middle (colored green) is the Java code that is running on-CPU. And in the bottom left, a small yellow tower shows CPU time spent in GC.

We've already used Java flame graphs to quantify performance improvements between frameworks (Tomcat vs rxNetty), which included identifying time spent in Java code compilation, the Java code cache, other system libraries, and differences in kernel code execution. All of these CPU consumers were invisible to other Java profilers, which only focus on the execution of Java methods.

Flame Graph Interpretation
If you are new to flame graphs: The y axis is stack depth, and the x axis spans the sample population. Each rectangle is a stack frame (a function), where the width shows how often it was present in the profile. The ordering from left to right is unimportant (the stacks are sorted alphabetically).

In the previous example, color hue was used to highlight different code types: green for Java, yellow for C++, and red for system. Color intensity was simply randomized to differentiate frames (other color schemes are possible).

You can read the flame graph from the bottom up, which follows the flow of code from parent to child functions. Another way is top down, as the top edge shows the function running on CPU, and beneath it is its ancestry. Focus on the widest functions, which were present in the profile the most. See the CPU flame graphs page for more about interpretation, and Brendan's USENIX/LISA'13 talk (video).

The Problem with Profilers
In order to generate flame graphs, you need a profiler that can sample stack traces. There have historically been two types of profilers used on Java:
  • System profilers: such as Linux perf_events, which can profile system code paths, including libjvm internals, GC, and the kernel, but not Java methods.
  • JVM profilers: such as hprof, Lightweight Java Profiler (LJP), and commercial profilers. These show Java methods, but not system code paths.
To understand all types of CPU consumers, we previously used both types of profilers, creating a flame graph for each. This worked – sort of. While all CPU consumers could be seen, Java methods were missing from the system profile, which was crucial context we needed.

Ideally, we would have one flame graph that shows it all: system and Java code together.

A system profiler like Linux perf_events should be well suited to this task as it can interrupt any software asynchronously and capture both user- and kernel-level stacks. However, system profilers don't work well with Java. The problem is shown by the flame graph on the right. The Java stacks and method names are missing.

There were two specific problems to solve:
  1. The JVM compiles methods on the fly (just-in-time: JIT), and doesn't expose a symbol table for system profilers.
  2. The JVM also uses the frame pointer register on x86 (RBP on x86-64) as a general-purpose register, breaking traditional stack walking.
Brendan summarized these earlier this year in his Linux Profiling at Netflix talk for SCALE. Fortunately, there was already a fix for the first problem.

Fixing Symbols
In 2009, Linux perf_events added JIT symbol support, so that symbols from language virtual machines like the JVM could be inspected. To use it, your application creates a /tmp/perf-PID.map text file, which lists symbol addresses (in hex), sizes, and symbol names. perf_events looks for this file by default and, if found, uses it for symbol translations.

Java can create this file using perf-map-agent, an open source JVMTI agent written by Johannes Rudolph. The first version needed to be attached on Java startup, but Johannes enhanced it to attach later on demand and take a symbol dump. That way, we only load it if we need it for a profile. Thanks, Johannes!

Since symbols can change slightly during the profile (we’re typically profiling for 30 or 60 seconds), a symbol dump may include stale symbols. We’ve looked at taking two symbol dumps, before and after the profile, to highlight any such differences. Another approach in development involves a timestamped symbol log to ensure that all translations are accurate (although this requires always-on logging of symbols). So far symbol churn hasn’t been a large problem for us, after Java and JIT have “warmed up” and symbol churn is minimal (this can take a few minutes, given sufficient load). We do bear it in mind when interpreting flame graphs.

Fixing Frame Pointers
For many years the gcc compiler has reused the frame pointer as a compiler optimization, breaking stack traces. Some applications compile with the gcc option -fno-omit-frame-pointer, to preserve this type of stack walking, however, the JVM had no equivalent option. Could the JVM be modified to support this?

Brendan was curious to find out, and hacked a working prototype for OpenJDK. It involved dropping RBP from eligible register pools, eg (diff):
--- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad      2014-03-04 02:52:11.000000000 +0000
+++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 01:10:49.686044933 +0000
@@ -166,10 +166,9 @@
// 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ )
//

-// Class for all pointer registers (including RSP)
+// Class for all pointer registers (including RSP, excluding RBP)
reg_class any_reg(RAX, RAX_H,
RDX, RDX_H,
- RBP, RBP_H,
RDI, RDI_H,
RSI, RSI_H,
RCX, RCX_H,
... and then fixing the function prologues to store the stack pointer (rsp) into the frame pointer (base pointer) register (rbp):
--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04 02:52:11.000000000 +0000
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 23:57:11.589593723 +0000
@@ -5236,6 +5236,7 @@
// We always push rbp, so that on return to interpreter rbp, will be
// restored correctly and we can correct the stack.
push(rbp);
+ mov(rbp, rsp);
// Remove word for ebp
framesize -= wordSize;
It worked. Here are the before and after flame graphs. Brendan posted it, with example flame graphs, to the hotspot compiler devs mailing list. This feature request became JDK-8068945 for JDK9 and JDK-8072465 for JDK8.

Fixing this properly involved a lot more work (see discussions in the bugs and mailing list). Zoltán Majó, of Oracle, took this on and rewrote the patch. After testing, it was finally integrated into the early access releases of both JDK9 and JDK8 (JDK8 update 60 build 19), as the new JDK option: -XX:+PreserveFramePointer.

Many thanks to Zoltán, Oracle, and the other engineers who helped get this done!

Since use of this mode disables a compiler optimization, it does decrease performance slightly. We've found in tests that this costs between 0 and 3% extra CPU, depending on the workload. See JDK-8068945 for some additional benchmarking details. There are also other techniques for walking stacks, some with zero run time cost to make available, however, there are other downsides with these approaches.

Instructions
The following steps describe how these flame graphs can be created. We’re working on improving and automating these steps using Vector (more on that in a moment).

1. Install software
There are four components to install:

Linux perf_events
This is the standard Linux profiler, aka “perf” after its front end, and is included in the Linux source (tools/perf). Try running perf help to see if it is installed; if not, your distro may suggest how to get it, usually by adding a perf-tools-common package.

Java 8 update 60 build 19 (or newer)
This includes the frame pointer patch fix (JDK-8072465), which is necessary for Java stack profiling. It is currently released as early access (built from OpenJDK).

perf-map-agent
This is a JVMTI agent that provides Java symbol translation for perf_events is on github. Steps to build this typically involve:
apt-get install cmake
export JAVA_HOME=/path-to-your-new-jdk8
git clone --depth=1 https://github.com/jrudolph/perf-map-agent
cd perf-map-agent
cmake .
make
The current version of perf-map-agent can be loaded on demand, after Java is running.
WARNING: perf-map-agent is experimental code – use at your own risk, and test before use!

FlameGraph
This is some Perl software for generating flame graphs. It can be fetched from github:
git clone --depth=1 https://github.com/brendangregg/FlameGraph
This contains stackcollapse-perf.pl, for processing perf_events profiles, and flamegraph.pl, for generating the SVG flame graph.

2. Configure Java
Java needs to be running with the -XX:+PreserveFramePointer option, so that perf_events can perform frame pointer stack walks. As mentioned earlier, this can cost some performance, between 0 and 3% depending on the workload.

3a. Generate System Wide Flame Graphs
With this software and Java running with frame pointers, we can profile and generate flame graphs.

For example, taking a 30-second profile at 99 Hertz (samples per second) of all processes, then caching symbols for Java PID 1690, then generating a flame graph:
sudo perf record -F 99 -a -g -- sleep 30
java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar net.virtualvoid.perf.AttachOnce 1690 # run as same user as java
sudo chown root /tmp/perf-*.map
sudo perf script | stackcollapse-perf.pl | \
flamegraph.pl --color=java --hash > flamegraph.svg
The attach-main.jar file is from perf-map-agent, and stackcollapse-perf.pl and flamegraph.pl are from FlameGraph. Specify their full paths unless they are in the current directory.

These steps address some quirky behavior involving user permissions: sudo perf script only reads symbol files the current user (root) owns, and, perf-map-agent creates files with the same user ownership as the Java process, which for us is usually non-root. This means we have to change the ownership to root for the symbol file, and then run perf script.

With jmaps
Dealing with symbol files has become a chore, so we’ve been automating it. Here’s one example: jmaps, which can be used like so:
sudo perf record -F 99 -a -g -- sleep 30; sudo jmaps
sudo perf script | stackcollapse-perf.pl | \
flamegraph.pl --color=java --hash > flamegraph.svg
jmaps creates symbol files for all Java processes, with root ownership. You may want to write a similar “jmaps” helper for your environment (our jmaps example is unsupported). Remember to clean up the /tmp symbol files when you no longer need them!

3b. Generate By-Process Flame Graphs
The previous procedure grouped Java processes together. If it is important to separate them (and, on some of our instances, it is), you can modify the procedure to generate a by-process flame graph. Eg (with jmaps):
sudo perf record -F 99 -a -g -- sleep 30; sudo jmaps
sudo perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace | \
stackcollapse-perf.pl --pid | \
flamegraph.pl --color=java --hash > flamegraph.svg
The output of stackcollapse-perf.pl formats each stack as a single line, and is great food for grep/sed/awk. For the flamegraph at the top of this post, we used the above procedure, and added “| grep java-339” before the “| flamegraph.pl”, to isolate that one process. You could also use a “| grep -v cpu_idle”, to exclude the kernel idle threads.

Missing Frames
If you start using these flame graphs, you’ll notice that many Java frames (methods) are missing. Compared to the jstack(1) command line tool, the stacks seen in the flame graph look perhaps one third as deep, and are missing many frames. This is because of inlining, combined with this type of profiling (frame pointer based) which only captures the final executed code.

This hasn’t been much of a problem so far: even when many frames are missing, enough remain that we can figure out what’s going on. We’ve also experimented with reducing the amount of inlining, eg, using -XX:InlineSmallCode=500, to increase the number of frames in the profile. In some cases this even improves performance slightly, as the final compiled instruction size is reduced, fitting better into the processor caches (we confirmed this using perf_events separately).

Another approach is to use JVMTI information to unfold the inlined symbols. perf-map-agent has a mode to do this; however, Min Zhou from LinkedIn has experienced Java crashes when using this, which he has been fixing in his version. We’ve not seen these crashes (as we rarely use that mode), but be warned.

Vector
The previous steps for generating flame graphs are a little tedious. As we expect these flame graphs will become an everyday tool for Java developers, we’ve looked at making them as easy as possible: a point-and-click interface. We’ve been prototyping this with our open source instance analysis tool: Vector.

Vector was described in more details in a previous techblog post. It provides a simple way for users to visualize and analyze system and application-level metrics in near real-time, and flame graphs is a great addition to the set of functionalities it already provides.

We tried to keep the user interaction as simple as possible. To generate a flame graph, you connect Vector to the target instance, add the flame graph widget to the dashboard, then click the generate button. That's it!

Behind the scenes, Vector requests a flame graph from a custom instance agent that we developed, which also supplies Vector's other metrics. Vector checks the status of this request while fetching and displaying other metrics, and displays the flame graph when it is ready.

Our custom agent is not generic enough to be used by everyone yet (it depends on the Netflix environment), so we have yet to open-source it. If you're interested in testing or extending it, reach out to us.

Future Work
We have some enhancements planned. One is for regression analysis, by automatically collecting flame graphs over different days and generating flame graph differentials for them. This will help us quickly understand changes in CPU usage due to software changes.

Apart from CPU profiling, perf_events can also trace user- and kernel-level events, including disk I/O, networking, scheduling, and memory allocation. When these are synchronously triggered by Java, a mixed-mode flame graph will show the code paths that led to these events. A page fault mixed-mode flame graph, for example, can be used to show which Java code paths led to an increase in main memory usage (RSS).

We also want to develop enhancements for flame graphs and Vector, including real time updates. For this to work, our agent will collect perf_events directly and return a data structure representing the partial flame graph to Vector with every check. Vector, with this information, will be able to assemble the flame graph in real time, while the profile is still being collected. We are also investigating using D3 for flame graphs, and adding interactivity improvements.

Other Work
Twitter have also explored making perf_events and Java work better together, which Kaushik Srenevasan summarized in his Tracing and Profiling talk from OSCON 2014 (slides). Kaushik showed that perf_events has much lower overhead when compared to some other Java profilers, and included a mixed-mode stack trace from perf_events. David Keenan from Twitter also described this work in his Twitter-Scale Computing talk (video), as well as summarizing other performance enhancements they have been making to the JVM.

At Google, Stephane Eranian has been working on perf_events and Java as well and has posted a patch series that supports a timestamped JIT symbol transaction log from Java for accurate symbol translation, solving the stale symbol problem. It’s impressive work, although a downside with the logging technique may be the performance cost of always logging symbols even if a profiler is never used.

Conclusion
CPU mixed-mode flame graphs help identify and quantify all CPU consumers. They show the CPU time spent in Java methods, system libraries, and the kernel, all in one visualization. This reveals CPU consumers that are invisible to other profilers, and have so far been used to identify issues and explain performance changes between software versions.

These mixed-mode flame graphs have been made possible by a new option in the JVM: -XX:+PreserveFramePointer, available in early access releases. In this post we described how these work, the challenges that were addressed, and provided instructions for their generation. Similar visibility for Node.js was described in our earlier post: Node.js in Flames.

by Brendan Gregg and Martin Spier

Tuning Tomcat For A High Throughput, Fail Fast System

Problem

Netflix has a number of high throughput, low latency mid tier services. In one of these services, it was observed that in case there is a huge surge in traffic in a very short span of time, the machines became cpu starved and would become unresponsive. This would lead to a bad experience for the clients of this service. They would get a mix of read and connect timeouts. Read timeouts can be particularly bad if the read timeouts are set to be very high. The client machines will wait to hear from the server for a long time. In case of SOA, this can lead to a ripple effect as the clients of these clients will also start getting read timeouts and all services can slow down. Under normal circumstances, the machines had ample amount of cpu free and the service was not cpu intensive. So, why does this happen? In order to understand that, let's first look at the high level stack for this service. The request flow would look like this


Image may be NSFW.
Clik here to view.


On simulating the traffic surge in the test environment it was found that the reason for cpu starvation was improper apache and tomcat configuration. On a sudden increase in traffic, multiple apache workers became busy and a very large number of tomcat threads also got busy. There was a huge jump in system cpu as none of the threads could do any meaningful work since most of the time cpu would be context switching.

Solution

Since this was a mid tier service, there was not much use of apache. So, instead of tuning two systems (apache and tomcat), it was decided to simplify the stack and get rid of apache. To understand why too many tomcat threads got busy, let's understand the tomcat threading model.

High Level Threading Model for Tomcat Http Connector

Tomcat has an acceptor thread to accept connections. In addition, there is a pool of worker threads which do the real work. The high level flow for an incoming request is:
  1. TCP handshake between OS and client for establishing a connection. Depending on the OS implementation there can be a single queue for holding the connections or there can be multiple queues. In case of multiple queues, one holds incomplete connections which have not yet completed the tcp handshake. Once completed, connections are moved to the completed connection queue for consumption by the application. "acceptCount" parameter in tomcat configuration is used to control the size of these queues.
  2. Tomcat acceptor thread accepts connections from the completed connection queue.
  3. Checks if a worker thread is available in the free thread pool. If not, creates a worker thread if the number of active threads < maxThreads. Else wait for a worker thread to become free.
  4. Once a free worker thread is found, acceptor thread hands the connection to it and gets back to listening for new connections.
  5. Worker thread does the actual job of reading input from the connection, processing the request and sending the response to the client. If the connection was not keep alive then it closes the connection and places itself in the free thread pool. For a keep alive connection, waits for more data to be available on the connection. In case data does not become available until keepAliveTimeout, closes the connection and makes itself available in the free thread pool.
In case the number of tomcat threads and acceptCount values are set to be too high, a sudden increase in traffic will fill up the OS queues and make all the worker threads busy. When more requests than that can be handled by the system are sent to the machines, this "queuing" of requests is inevitable and will lead to increased busy threads, causing cpu starvation eventually.  Hence, the crux of the solution is to avoid too much queuing of requests at multiple points (OS and tomcat threads) and fail fast (return http status 503) as soon the application's maximum capacity is reached. Here is a recommendation for doing this in practice:

Fail fast in case the system capacity for a machine is hit

Estimate the number of threads expected to be busy at peak load. If the server responds back in 5 ms on avg for a request, then a single thread can do a max of 200 requests per second (rps). In case the machine has a quad core cpu, it can do max 800 rps. Now assume that 4 requests (since the assumption is that the machine is a quad core) come in parallel and hit the machine. This will make 4 worker threads busy. For the next 5 ms all these threads will be busy. The total rps to the system is the max value of 800, so in next 5 ms, 4 more requests will come and make another 4 threads busy. Subsequent requests will pick up one of the already busy threads which has become free. So, on an average there should not be more than 8 threads busy at 800 rps. The behavior will be a little different in practice because all system resources like cpu will be shared. Hence one should experiment for the total throughput the system can sustain and do a calculation for expected number of busy threads. This will provide a base line for the number of threads needed to sustain peak load. In order to provide some buffer lets more than triple the number of max threads needed to 30. This buffer is arbitrary and can be further tuned if needed. In our experiments we used a slightly more than 3 times buffer and it worked well.

Track the number of active concurrent requests in memory and use it for fast failing. If the number of concurrent requests is near the estimated active threads (8 in our example) then return an http status code of 503. This will prevent too many worker threads becoming busy because once the peak throughput is hit, any extra threads which become active will be doing a very light weight job of returning 503 and then be available for further processing.

Configure Operating System parameters

The acceptCount parameter for tomcat dictates the length of the queues at the OS level for completing tcp handshake operations (details are OS specific). It's important to tune this parameter, otherwise one can have issues with establishing connections to the machine or it can lead to excessive queuing of connections in OS queues which will lead to read timeouts. The implementation details of handling incomplete and complete connections vary across OS. There can be a single queue of connections or multiple queues for incomplete and complete connections (please refer to the References section for details). So, a nice way to tune the acceptCount parameter is to start with a small value and keep increasing it unless the connection errors get removed.

Having too large a value for acceptCount means that the incoming requests can get accepted at the OS level. However, if the incoming rps is more than what a machine can handle, all the worker threads will eventually become busy and then the acceptor thread will wait for a worker thread to become free. More requests will continue to pile up in the OS queues since acceptor thread will consume them only when a worker thread becomes available. In the worst case, these requests will timeout while waiting in the OS queues, but will still be processed by the server once they get picked by the tomcat's acceptor thread. This is a complete waste of processing resources as a client will never receive any response.

If the value of acceptCount is too small, then in case of a high rps there will not be enough space for OS to accept connections and make it available for the acceptor thread. In this case, connect timeout errors will be returned to the client way below the actual throughput for the server is reached.

Hence experiment by starting with a small value like 10 for acceptCount and keep increasing it until there are are no connection errors from the server.

On doing both the changes above, even if all the worker threads become busy in the worst case, the servers will not be cpu starved and will be able to do as much work as possible (max throughput).

Other considerations

As explained above, each incoming connection is ultimately handled to a worker tomcat thread. In case http keep alive is turned on, a worker thread will continue to listen on a connection and will not be available in the free thread pool. So, if the clients are not smart to close the connection once it's not being actively used, the server can very easily run out of worker threads. If keep alive is turned on then one has to size the server farm by keeping this constraint in mind.

Alternatively, if keep alive is turned off then one does not have to worry about the problem of inactive connections using worker threads. However, in this case on each call one has to pay the price of opening and closing the connection. Further, this will also create a lot of sockets in the TIME_WAIT state which can put pressure on the servers.

Its best to pick the choice based on the use cases for the application and to test the performance by running experiments.

Results

Multiple experiments were run with different configurations. The results are shown below. The dark blue line is the original configuration with apache and tomcat. All the other are different configurations for the stack with only tomcat

Image may be NSFW.
Clik here to view.

Throughput
Note the drop after a sustained period of traffic higher than what can be served by server.

Image may be NSFW.
Clik here to view.

Busy Apache Workers

Image may be NSFW.
Clik here to view.

Idle cpu
Note that the original configuration got so busy that it was not even able to publish the stats for idle cpu on a continuous basis. The stats were published (valued 0) for the base configuration intermittently as highlighted in the red circles

Image may be NSFW.
Clik here to view.

Server average latency to process a request

Note

Its possible to achieve the same results by tuning the combination of apache and tomcat to work together. However, since there was not much use of apache for our service, we found the above model simpler with one less moving part. It's best to make choices by a combination of understanding the system and use of experimentation and testing in a real-world environment to verify hypothesis.

References

  1. https://books.google.com/books/about/UNIX_Network_Programming.html?id=ptSC4LpwGA0C&source=kp_cover&hl=en
  2. http://www.sean.de/Solaris/soltune.html
  3. https://tomcat.apache.org/tomcat-7.0-doc/config/http.html
  4. http://grepcode.com/project/repository.springsource.com/org.apache.coyote/com.springsource.org.apache.coyote/

Acknowledgment

I would like to thank Mohan Doraiswamy for his suggestions in this effort.

Netflix at Velocity 2015: Linux Performance Tools

There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? At Velocity 2015, I gave a 90 minute tutorial on Linux performance tools. I’ve spoken on this topic before, but given a 90 minute time slot I was able to include more methodologies, tools, and live demonstrations, making it the most complete tour of the topic I’ve done. The video and slides are below.

In this tutorial I summarize traditional and advanced performance tools, including: top, ps, vmstat, iostat, mpstat, free, strace, tcpdump, netstat, nicstat, pidstat, swapon, lsof, sar, ss, iptraf, iotop, slaptop, pcstat, tiptop, rdmsr, lmbench, fio, pchar, perf_events, ftrace, SystemTap, ktap, sysdig, and eBPF; and reference many more. I also include updated tools diagrams for observability, sar, benchmarking, and tuning (including the image above).

This tutorial can be shared with a wide audience – anyone working on Linux systems – as a free crash course on Linux performance tools. I hope people enjoy it and find it useful. Here's the playlist.

Part 1 (youtube) (54 mins):

Part 2 (youtube) (45 mins):

Slides (slideshare):

At Netflix, we have Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. Much of the time we don't need to login to instances directly, but when we do, this tutorial covers the tools we use.

Thanks to O'Reilly for hosting a great conference, and those who attended.

If you are passionate about the content in this tutorial, we're hiring, particularly for senior SREs and performance engineers: see Netflix jobs!

Making Netflix.com Faster

by Kristofer Baxter

Simply put, performance matters. We know members want to immediately start browsing or watching their favorite content and have found that faster startup leads to more satisfying usage. So, when building the long-awaited update to netflix.com, the Website UI Engineering team made startup performance a first tier priority.

The impact of this effort netted a 70% reduction in startup time, and was focused in three key areas:

  1. Server and Client Rendering
  2. Universal JavaScript
  3. JavaScript Payload Reductions

Server and Client Rendering

The netflix.com legacy website stack had a hard separation between server markup and client enhancement. This was primarily due to the different programming languages used in each part of our application. On the server, there was Java with Tomcat, Struts and Tiles. On the browser client, we enhanced server-generated markup with JavaScript, primarily via jQuery.

This separation led to undesirable results in our startup time. Every time a visitor came to any page on netflix.com our Java tier would generate the majority of the response needed for the entire page's lifetime and deliver it as HTML markup. Often, users would be waiting for the generation of markup for large parts of the page they would never visit.

Our new architecture renders only a small amount of the page's markup, bootstrapping the client view. We can easily change the amount of the total view the server generates, making it easy to see the positive or negative impact. The server requires less data to deliver a response and spends less time converting data into DOM elements. Once the client JavaScript has taken over, it can retrieve all additional data for the remainder of the current and future views of a session on demand. The large wins here were the reduction of processing time in the server, and the consolidation of the rendering into one language.

We find the flexibility afforded by server and client rendering allows us to make intelligent choices of what to request and render in the server and the client, leading to a faster startup and a smoother transition between views.

Universal JavaScript

In order to support identical rendering on the client and server, we needed to rethink our rendering pipeline. Our previous architecture's separation between the generation of markup on the server and the enhancement of it on the client had to be dropped.

Three large pain points shaped our new Node.js architecture:

  1. Context switching between languages was not ideal.
  2. Enhancement of markup required too much direct coupling between server-only code generating markup and the client-only code enhancing it.
  3. We’d rather generate all our markup using the same API.

There are many solutions to this problem that don't require Universal JavaScript, but we found this lesson was most appropriate: When there are two copies of the same thing, it's fairly easy for one to be slightly different than the other. Using Universal JavaScript means the rendering logic is simply passed down to the client.

Node.js and React.js are natural fits for this style of application. With Node.js and React.js, we can render from the server and subsequently render changes entirely on the client after the initial markup and React.js components have been transmitted to the browser. This flexibility allows for the application to render the exact same output independent of the location of the rendering. The hard separation is no longer present and it's far less likely for the server and client to be different than one another.

Without shared rendering logic we couldn't have realized the potential of rendering only what was necessary on startup and everything else as data became available.

Reduce JavaScript Payload Impact

Building rich interactive experiences on the web often translates into a large JavaScript payload for users. In our new architecture, we placed significant emphasis on pruning large dependencies we can knowingly replace with smaller modules and delivering JavaScript only applicable for the current visitor.

Many of the large dependencies we relied on in the legacy architecture didn't apply in the new one. We've replaced these dependencies in favor of newer, more efficient libraries. Replacing these libraries resulted in a much smaller JavaScript payload, meaning members need less JavaScript to start browsing. We know there is significant work remaining here, and we're actively working to trim our JavaScript payload down further.

Time To Interactive

In order to test and understand the impact of our choices, we monitor a metric we call time to interactive (tti).

Amount of time spent between first known startup of the application platform and when the UI is interactive regardless of view. Note that this does not require that the UI is done loading, but is the first point at which the customer can interact with the UI using an input device.

For applications running inside a web browser, this data is easily retrievable from the Navigation Timing API (where supported).

Work is Ongoing

We firmly believe high performance is not an optional engineering goal – it's a requirement for creating great user-experiences. We have made significant strides in startup performance, and are committed to challenging our industry’s best-practices in the pursuit of a better experience for our members.

Over the coming months we'll be investigating Service Workers, ASM.js, Web Assembly, and other emerging web standards to see if we can leverage them for a more performant website experience. If you’re interested in helping create and shape the next generation of performant web user-experiences apply here.

RAD - Outlier Detection on Big Data

Outlier detection can be a pain point for all data driven companies, especially as data volumes grow. At Netflix we have multiple datasets growing by 10B+ record/day and so there’s a need for automated anomaly detection tools ensuring data quality and identifying suspicious anomalies. Today we are open-sourcing our outlier detection function, called Robust Anomaly Detection (RAD), as part of our Surus project.

As we built RAD we identified four generic challenges that are ubiquitous in outlier detection on “big data.”

  • High cardinality dimensions: High cardinality data sets - especially those with large combinatorial permutations of column groupings - makes human inspection impractical.
  • Minimizing False Positives: A successful anomaly detection tool must minimize false positives. In our experience there are many alerting platforms that “sound an alarm” that goes ultimately unresolved. The goal is to create alerting mechanisms that can be tuned to appropriately balance noise and information.
  • Seasonality: Hourly/Weekly/Bi-weekly/Monthly seasonal effects are common and can be mis-identified as outliers deserving attention if not handled properly. Seasonal variability needs to be ignored.
  • Data is not always normally distributed: This has been a particular challenge since Netflix has been growing over the last 24 months. Generally though, an outlier tool must be robust so that it works on data that is not normally distributed.

In addition to addressing the challenges above, we wanted a solution with a generic interface (supporting application development). We met these objectives with a novel algorithm encased in a wrapper for easy deployment in our ETL environment.

Algorithm


We initially tested techniques like moving averages with standard deviations and time series/regression models (ARIMAX) but found that these simpler methods were not robust enough in high cardinality data.


The algorithm we finally settled on uses Robust Principal Component Analysis (RPCA) to detect anomalies. PCA uses the Singular Value Decomposition (SVD) to find low rank representations of the data. The robust version of PCA (RPCA) identifies a low rank representation, random noise, and a set of outliers by repeatedly calculating the SVD and applying “thresholds” to the singular values and error for each iteration. For more information please refer to the original paper by Candes et al. (2009).

Below is an interactive visualization of the algorithm at work on a simple/random dataset and on public climate data.




Pig Wrapper


Since Apache Pig is the primary ETL language at Netflix, we wrapped this algorithm in a Pig function enabling engineers to easily use it with just a few additional lines of code. We’ve open-sourced both the Java function that implements the algorithm and the Pig wrapper. The details and a sample application (with code) can be found here.

Business Application


The following are two popular applications where we initially implemented this anomaly detection system at Netflix with great success.


Netflix processes millions of transactions every day across tens of thousands of banking institutions/infrastructures in both real-time and batch environments. We’ve used the above solution to detect anomalies in failures in the payment network at a bank level. With the above system, business managers were able to follow up with their counterparts in the payment industry and thereby reducing the impact on Netflix customers


Our signup flow was another important point of application. Today Netflix customers sign up across the world on hundreds of different types of browsers or devices. Identifying anomalies across unique combinations of country, browser/device and language helps our engineers understand and react to customer sign up problems in a timely manner.

Conclusion


A robust algorithm is paramount to the success of any anomaly detection system and RPCA has worked very well for detecting anomalies. Along with the algorithm, our focus on simplifying the implementation with a Pig wrapper made the tool a great success. The applications listed above have helped the Netflix data teams understand and react to anomalies faster--which reduces the impact to Netflix customers and our overall business.



Introducing Surus and ScorePMML

Today we’re announcing a new Netflix-OSS project called Surus. Over the next year we plan to release a handful of our internal user defined functions (UDF’s) that have broad adoption across Netflix. The use cases for these functions are varied in nature (e.g. scoring predictive models, outlier detection, pattern matching, etc.) and together extend the analytical capabilities of big data.

The first function we’re releasing allows for efficient scoring of predictive models in Apache Pig using Predictive Modeling Markup Language. PMML is an open source standard that supports a concise representation of predictive models in XML and hence the name of the new function, ScorePMML.

ScorePMML


At Netflix, we use predictive models everywhere. Although the applications for each model are different, the process by which each of these predictive models is built and deployed is consistent. The process usually looks like this:

  1. Someone proposes an idea and builds a model on “small” data
  2. We decide to “scale-up” the prototype to see how well the model generalizes to a larger dataset
  3. We may eventually put the model into “production”

At Netflix, we have different tools for each step above. When scoring data in our hadoop environment, we noticed a proliferation of custom scoring approaches operating in steps two and three. This implementation of custom scoring approaches added overhead as individual developers migrated models through the process. Our solution was to adopt PMML as a standard way to represent model output and to write ScorePMML as a UDF for scoring PMML files at scale.

ScorePMML aligns Netflix predictive modeling capabilities around the open-source PMML standard. By leveraging the open-source standard, we enable a flexible and consistent representation of predictive models for each of the steps mentioned above. By using the same PMML representation of the predictive model at each step in the modeling process, we save time/money by reducing both the risk and cost of custom code. PMML provides an effective foundation to iterate quickly for the modeling methods it supports. Our data scientists have started adopting ScorePMML where it allows them to iterate and deploy models more effectively than the legacy approach.

An Example


Now for the practical part. Let’s imagine that you’re building a model in R. You might do something like this….

# Required Dependencies
require(randomForest)
require(gbm)
require(pmml)
require(XML)
data(iris)

# Column Names must NOT contain periods
names(iris) <- gsub("\\.","_",tolower(names(iris)))

# Build Models
iris.rf  <- randomForest(Species ~ ., data=iris, ntree=5)
iris.gbm <- gbm(Species ~ ., data=iris, n.tree=5)

# Convert to pmml
# Output to File
saveXML(pmml(iris.rf) ,file="~/iris.rf.xml")
saveXML(pmml(iris.gbm, n.trees=5),file="~/iris.gbm.xml")

And, now let’s say that you want to score 100 billion rows…

REGISTER '~/scoring.jar';

DEFINE pmmlRF  com.netflix.pmml.ScorePMML('~/iris.rf.xml');
DEFINE pmmlGBM com.netflix.pmml.ScorePMML('~/iris.gbm.xml');

-- LOAD Data
iris = load '~/iris.csv' using PigStorage(',') as
      (sepal_length,sepal_width,petal_length,petal_width,species);

-- Score two models in one pass over the data
scored = foreach iris generate pmmlRF(*) as RF, pmmlGBM(*) as GBM;
dump scored;

That’s how easy it should be.

There are a couple of things you should think about though before trying to score 100 billion records in Pig.  

  • We throw a Pig FrontendException when the Pig/Hive data types and column names don’t match the data types and column names in PMML. This means that you don’t need to wait for the Hadoop MR job to start before getting the feedback that something is wrong.
  • The ScorePMML constructor accepts local or remote file locations. This means that you can reference an HDFS or S3 path, or you can reference a local path (see the example above).
  • We’ve made scoring multiple models in parallel trivial. Furthermore, models are only read into memory once, so there isn’t a penalty when processing multiple models at the same time.
  • When scoring big (and usually uncontrolled) datasets it’s important to handle errors gracefully. You don’t want to rescore 100 records because you fail on the 101st record. Rather than throwing an exception (and failing the job) we’ve added an indicator to the output tuple that can be used for alerting.
  • Although this is currently written to be run in Pig we may migrate in the future to different platforms.

Obviously, more can be done. We welcome ideas on how to make the code better.  Feel free to make a pull request!

Conclusion


We’re excited to introduce Surus and share with the world in the upcoming months various UDF’s we find helpful while analyzing data at Netflix. ScorePMML was a big win for Netflix as we sought to streamline our processing and to minimize the time to production for our models. We hope that with this function (and others soon to be released) that you’ll be able to spend more time making cool stuff and less time struggling with the mundane.

Known Issues/Limitations


  • ScorePMML is built on jPMML 1.0.19, which doesn’t fully support the 4.2 PMML specification (as defined by the Data Mining Group). At the time of this writing not all enumerated missing value strategies are supported. This caused problems when we wanted to implement GBMs in PMML, so we had to add extra nodes in each tree to properly handle missing values.
  • Hive 0.12.0 (and thus Pig) has strict naming conventions for columns/relations which are relaxed in PMML. Non alpha-numeric characters in column names are not supported in ScorePMML. Please see the Hive documentation for more details on column naming in the Hive metastore.

Additional Resources


  • The Data Mining Group PMML Spec: The 4.1.2 specification is currently supported. The 4.2 version of the PMML spec is not currently supported. The DMG page will give you a sense of which model types are supported and how they are described in PMML.
  • jPMML: A collection of GitHub projects that contain tools for using PMML. Including an alternative Pig implementation, jpmml-pig, written by Villu Ruusmann.
  • RPMML: An R-package for creating PMML files from common predictive modeling objects.

Netflix's Viewing Data: How We Know Where You Are in House of Cards

Over the past 7 years, Netflix streaming has expanded from thousands of members watching occasionally to millions of members watching over two billion hours every month.  Each time a member starts to watch a movie or TV episode, a “view” is created in our data systems and a collection of events describing that view is gathered.  Given that viewing is what members spend most of their time doing on Netflix, having a robust and scalable architecture to manage and process this data is critical to the success of our business.  In this post we’ll describe what works and what breaks in an architecture that processes billions of viewing-related events per day.

Use Cases

By focusing on the minimum viable set of use cases, rather than building a generic all-encompassing solution, we have been able to build a simple architecture that scales.  Netflix’s viewing data architecture is designed for a variety of use cases, ranging from user experiences to data analytics.  The following are three key use cases, all of which affect the user experience:

What titles have I watched?

Our system needs to know each member’s entire viewing history for as long as they are subscribed.  This data feeds the recommendation algorithms so that a member can find a title for whatever mood they’re in.  It also feeds the “recent titles you’ve watched” row in the UI.  What gets watched provides key metrics for the business to measure member engagement and make informed product and content decisions.

Where did I leave off in a given title?

For each movie or TV episode that a member views, Netflix records how much was watched and where the viewer left off.   This enables members to continue watching any movie or TV show on the same or another device.

What else is being watched on my account right now?

Sharing an account with other family members usually means everyone gets to enjoy what they like when they’d like.  It also means a member may have to have that hard conversation about who has to stop watching if they’ve hit their account’s concurrent screens limit.  To support this use case, Netflix’s viewing data system gathers periodic signals throughout each view to determine whether a member is or isn’t still watching.

Current Architecture

Our current architecture evolved from an earlier monolithic database-backed application (see this QCon talk or slideshare for the detailed history).  When it was designed, the primary requirements were that it must serve the member-facing use cases with low latency and it should be able to handle a rapidly expanding set of data coming from millions of Netflix streaming devices.  Through incremental improvements over 3+ years, we’ve been able to scale this to handle low billions of events per day.

Current Architecture Diagram

Image may be NSFW.
Clik here to view.
The current architecture’s primary interface is the viewing service, which is segmented into a stateful and stateless tier.  The stateful tier has the latest data for all active views stored in memory.  Data is partitioned into N stateful nodes by a simple mod N of the member’s account id.  When stateful nodes come online they go through a slot selection process to determine which data partition will belong to them.  Cassandra is the primary data store for all persistent data.  Memcached is layered on top of Cassandra as a guaranteed low latency read path for materialized, but possibly stale, views of the data.


We started with a stateful architecture design that favored consistency over availability in the face of network partitions (for background, see the CAP theorem).  At that time, we thought that accurate data was better than stale or no data.  Also, we were pioneering running Cassandra and memcached in the cloud so starting with a stateful solution allowed us to mitigate risk of failure for those components.  The biggest downside of this approach was that failure of a single stateful node would prevent 1/nth of the member base from writing to or reading from their viewing history.


After experiencing outages due to this design, we reworked parts of the system to gracefully degrade and provide limited availability when failures happened.  The stateless tier was added later as a pass-through to external data stores. This improved system availability by providing stale data as a fallback mechanism when a stateful node was unreachable.

Breaking Points

Our stateful tier uses a simple sharding technique (account id mod N) that is subject to hot spots, as Netflix viewing usage is not evenly distributed across all current members.  Our Cassandra layer is not subject to these hot spots, as it uses consistent hashing with virtual nodes to partition the data.  Additionally, when we moved from a single AWS region to running in multiple AWS regions, we had to build a custom mechanism to communicate the state between stateful tiers in different regions.  This added significant, undesirable complexity to our overall system.


We created the viewing service to encapsulate the domain of viewing data collection, processing, and providing.  As that system evolved to include more functionality and various read/write/update use cases, we identified multiple distinct components that were combined into this single unified service.  These components would be easier to develop, test, debug, deploy, and operate if they were extracted into their own services.


Memcached offers superb throughput and latency characteristics, but isn’t well suited for our use case.  To update the data in memcached, we read the latest data, append a new view entry (if none exists for that movie) or modify an existing entry (moving it to the front of the time-ordered list), and then write the updated data back to memcached.  We use an eventually consistent approach to handling multiple writers, accepting that an inconsistent write may happen but will get corrected soon after due to a short cache entry TTL and a periodic cache refresh.  For the caching layer, using a technology that natively supports first class data types and operations like append would better meet our needs.


We created the stateful tier because we wanted the benefit of memory speed for our highest volume read/write use cases.  Cassandra was in its pre-1.0 versions and wasn’t running on SSDs in AWS.  We thought we could design a simple but robust distributed stateful system exactly suited to our needs, but ended up with a complex solution that was less robust than mature open source technologies.  Rather than solve the hard distributed systems problems ourselves, we’d rather build on top of proven solutions like Cassandra, allowing us to focus our attention on solving the problems in our viewing data domain.


Next Generation Architecture

In order to scale to the next order of magnitude, we’re rethinking the fundamentals of our architecture.  The principles guiding this redesign are:
  • Availability over consistency - our primary use cases can tolerate eventually consistent data, so design from the start favoring availability rather than strong consistency in the face of failures.
  • Microservices - Components that were combined together in the stateful architecture should be separated out into services (components as services).
    • Components are defined according to their primary purpose - either collection, processing, or data providing.
    • Delegate responsibility for state management to the persistence tiers, keeping the application tiers stateless.
    • Decouple communication between components by using signals sent through an event queue.
  • Polyglot persistence - Use multiple persistence technologies to leverage the strengths of each solution.
    • Achieve flexibility + performance at the cost of increased complexity.
    • Use Cassandra for very high volume, low latency writes.  A tailored data model and tuned configuration enables low latency for medium volume reads.
    • Use Redis for very high volume, low latency reads.  Redis’ first-class data type support should support writes better than how we did read-modify-writes in memcached.


Our next generation architecture will be made up of these building blocks:

Image may be NSFW.
Clik here to view.


Re-architecting a critical system to scale to the next order of magnitude is a hard problem, requiring many months of development, testing, proving out at scale, and migrating off of the previous architecture.  Guided by these architectural principles, we’re confident that the next generation that we are building will give Netflix a strong foundation to meet the needs of our massive and growing scale, enabling us to delight our global audience.  We are in the early stages of this effort, so if you are interested in helping, we are actively hiringfor this work.   In the meantime, we’ll follow up this post with a future one focused on the new architecture.



Netflix Likes React


We are making big changes in the way we build the Netflix experience with Facebook’s React library. Today, we will share our thoughts on what makes React so compelling and how it is evolving our approach to UI development.

At the beginning of last year, Netflix UI engineers embarked on several ambitious projects to dramatically transform the user experience on our desktop and mobile platforms. Given a UI redesign of a scale similar to that undergone by TVs and game consoles, it was essential for us to re-evaluate our existing UI technology stack and to determine whether to explore new solutions. Do we have the right building blocks to create best-in-class single-page web applications? And what specific problems are we looking to solve?
Much of our existing front-end infrastructure consists of hand-rolled components optimized for the current website and iOS application. Our decision to adopt React was influenced by a number of factors, most notably: 1) startup speed, 2) runtime performance, and 3) modularity.

Startup Speed

We want to reduce the initial load time needed to provide Netflix members with a much more seamless, dynamic way to browse and watch individualized content. However, we find that the cost to deliver and render the UI past login can be significant, especially on our mobile platforms where there is a lot more variability in network conditions.

In addition to the time required to bootstrap our single-page application (i.e. download and process initial markup, scripts, stylesheets), we need to fetch data, including movie and show recommendations, to create a personalized experience. While network latency tends to be our biggest bottleneck, another major factor affecting startup performance is in the creation of DOM elements based on the parsed JSON payload containing the recommendations. Is there a way to minimize the network requests and processing time needed to render the home screen? We are looking for a hybrid solution that will allow us to deliver above-the-fold static markup on first load via server-side rendering, thereby reducing the tax incurred in the aforementioned startup operations, and at the same time enable dynamic elements in the UI through client-side scripting.

Runtime Performance

To build our most visually-rich cinematic Netflix experience to date for the website and iOS platforms, efficient UI rendering is critical. While there are fewer hardware constraints on desktops (compared to TVs and set-top boxes), expensive operations can still compromise UI responsiveness. In particular, DOM manipulations that result in reflows and repaints are especially detrimental to user experience.

Modularity

Our front-end infrastructure must support the numerous A/B tests we run in terms of the ability to rapidly build out new features and designs that code-wise must co-exist with the control experience (against which the new experiences are tested). For example, we can have an A/B test that compares 9 different design variations in the UI, which could mean maintaining code for 10 views for the duration of the test. Upon completion of the test, it should be easy for us to productize the experience that performed the best for our members and clean up code for the 9 other views that did not.

Advantages of React

React stood out in that its defining features not only satisfied the criteria set forth above, but offered other advantages including being relatively easy to grasp and ability to opt-out, for example, to handle custom user interactions and rendering code. We were able to leverage the following features to improve our application’s initial load times, runtime performance, and overall scalability: 1) isomorphic JavaScript, 2) virtual DOM rendering, and 3) support for compositional design patterns.

Isomorphic JavaScript

React enabled us to build JavaScript UI code that can be executed in both server (e.g. Node.js) and client contexts. To improve our start up times, we built a hybrid application where the initial markup is rendered server-side and the resulting UI elements are subsequently manipulated as done in a single-page application. It was possible to achieve this with React as it can render without a live DOM, e.g. via React.renderToString, or React.renderToStaticMarkup. Furthermore, the UI code written using the React library that is responsible for generating the markup could be shared with the client to handle cases where re-rendering was necessary.

Virtual DOM

To reduce the penalties incurred by live DOM manipulation, React applies updates to a virtual DOM in pure JavaScript and then determines the minimal set of DOM operations necessary via a diff algorithm. The diffing of virtual DOM trees is fast relative to actual DOM modifications, especially using today’s increasingly efficient JavaScript engines such as WebKit’s Nitro with JIT compilation. Furthermore, we can eliminate the need for traditional data binding, which has its own performance implications and scalability challenges.

React Components and Mixins

React provides powerful Component and Mixin APIs that we relied on heavily to create reusable views, share common functionality, and patterns to facilitate feature extension. When A/B testing different designs, we can implement the views as separate React subcomponents that get rendered by a parent component depending on the user’s allocation in the test. Similarly, differences in behavioral logic can be abstracted into React mixins. Although it is possible to achieve modularity with a classical inheritance pattern, frequent changes in superclass interfaces to support new features affects existing subclasses and increases code fragility. React’s compositional pattern is ideal for overall maintenance and scalability of our front-end codebase as it isolates much of the A/B test code.

React has exceeded our requirements and enabled us to build a tremendous foundation on which to innovate the Netflix experience. Stay tuned in the coming months, as we will dive more deeply into how we are using React to transform traditional UI development!

By Jordanna Kwok


SPS : the Pulse of Netflix Streaming

We want to provide an amazing experience to each member, winning the “moments of truth” where they decide what entertainment to enjoy.  To do that, we need to understand the health of our system.  To quickly and easily understand the health of the system, we need a simple metric that a diverse set of people can comprehend.  In this post we will discuss how we discovered and aligned everyone around one operational metric indicating service health, enabling us to streamline production operations and improve availability.  We will detail how we approach signal analysis, deviation detection, and alerting for that signal and for other general use cases.

Creating the Right Signal

In the early days of Netflix streaming, circa 2008, we manually tracked hundreds of metrics, relying on humans to detect problems.  Our approach worked for tens of servers and thousands of devices, but not for the thousands of servers and millions of devices that were in our future.  Complexity and human-reliant approaches don’t scale; simplicity and algorithm-driven approaches do.

We sought out a single indicator that closely approximated our most important activity: viewing.  We discovered that a server-side metric related to playback starts (the act of “clicking play”) had both a predictable pattern and fluctuated significantly when UI/device/server problems were happening.  The Netflix streaming pulse was created.  We named it “SPS” for “starts per second”.

Image may be NSFW.
Clik here to view.
Example of stream starts per second, comparing week over week (red = current week, black = prior week)

The SPS pattern is regular within each geographic region, with variations due to external influences like major regional holidays and events.  The regional SPS pattern is one cycle per day and oscillates in a rising and falling pattern.  The peaks occur in the evening and the troughs occur in the early morning hours.  On regional holidays, when people are off work and kids are home from school, we see increased viewing in the daytime hours, as our members have more free time to enjoy viewing a title on Netflix.

Because there is consistency in the streaming behaviors of our members, with a moderate amount of data we can establish reliable predictions on how many stream starts we should expect at any point of the week.  Deviations from the prediction are a powerful way to tell the entire company when there is a problem with the service.  We use these deviations to trigger alerts and help us understand when service has been fully restored.

Its simplicity allows SPS to be central to our internal vernacular.  Production problems are categorized as “SPS impacting” or “not SPS impacting,” indicating their severity.  Overall service availability is measured using expected versus actual SPS levels.

Deviation Detection Models

To maximize the power of SPS, we need reliable ways to find deviations in our actual stream starts relative to our expected stream starts. The following are a range of techniques that we have explored in detecting such deviations.

Static Thresholds

A common starting place when trying to detect change is to define a fixed boundary that characterizes normal behavior. This boundary can be described as a floor or ceiling which, if crossed, indicates a deviation from normal behavior. The simplicity of static thresholds makes them a popular choice when trying to detect a presence, or increase, in a signal. For example, detecting when there is an increase in CPU usage:

Image may be NSFW.
Clik here to view.
Example where a static threshold could be used to help detect high CPU usage

However, static thresholds are insufficient in accurately capturing deviations in oscillating signals. For example, a static threshold would not be suitable for accurately detecting a drop in SPS due to its oscillating nature.

Exponential Smoothing

Another technique that can be used to detect deviations is to use an exponential smoothing function, such as exponential moving average, to compute an upper or lower threshold that bounds the original signal. These techniques assign exponentially decreasing weights as the observations get older. The benefit of this approach is that the bound is no longer static and can “move” with the input signal, as shown in the image below:

Image may be NSFW.
Clik here to view.
Example of data smoothing using moving average

Another benefit is that exponential smoothing techniques take into account all past observations. In addition, exponential smoothing requires only the most recent observation to be kept. These aspects make it desirable for real-time alerting.

Double Exponential Smoothing

To detect a change in SPS behavior, we use Double Exponential Smoothing (DES) to define an upper and lower boundary that captures the range of acceptable behavior. This technique includes a parameter that takes into account any trend in the data, which works well for following the oscillating trend in SPS. There are more advanced smoothing techniques, such as triple exponential smoothing, which also take into account seasonal trends. However, we do not use these techniques as we are interested in detecting a deviation in behavior over a short period of time which does not contain a pronounced seasonal trend.

Before creating a DES model one must first select values for the data smoothing factor and the trend smoothing factor. To visualize the effect these parameters have on DES, see this interactive visualization. The estimation of these parameters is crucial as they can greatly affect accuracy. While these parameters are typically determined by an individual's intuition or trial and error, we have experimented with data-driven approaches to automatically initialize them (motivated by Gardner [1]). We are able to apply those identified parameters to signals that share similar daily patterns and trends, for example SPS at the device level.

The image below shows an example where DES has been used to create a lower bound in an attempt to capture a deviation in normal behavior. Shortly after 18:00 there is a drop in SPS which crosses the DES threshold, alerting us to a potential issue with our service. By alerting on this drop, we can respond and take actions to restore service health.

Image may be NSFW.
Clik here to view.

While DES accurately identifies a drop in SPS, it is unable to predict when the system has recovered. In the example below, the sharp recovery of SPS at approximately 20:00 is not accurately modeled by DES causing it to underpredict and generate false alarms for a short period of time:

Image may be NSFW.
Clik here to view.

In spite of these shortcomings, DES has been an effective mechanism for detecting actionable deviations in SPS and other operational signals.

Advanced Techniques

We have begun experimenting with Bayesian techniques in a stream mining setting to improve our ability to detect deviations in SPS. An example of this is Bayesian switchpoint detection and Markov Chain Monte Carlo (MCMC) [2]. See [3] for a concise introduction to using MCMC for anomaly detection and [4] for Bayesian switchpoint detection.

Bayesian techniques offer some advantages over DES in this setting. Those familiar with probabilistic programming techniques know that the underlying models can be fairly complex, but they can be made to be non-parametric by drawing parameters from uniform priors when possible. Using the posteriors from such calculations as priors for the next iteration allows us to create models that evolve as they receive more data.

Unfortunately, our experiments with Bayesian anomaly detection have revealed downsides compared to DES. MCMC is significantly more computationally intensive than DES, so much so that some are exploiting graphics cards in order to reduce the run time [5], a common technique for computationally intensive processes [6]. Furthermore the underlying model is not as easily interpreted by a human due to the complexity of the parameter interactions. These limitations, especially the performance related ones, restrict our ability to apply these techniques to a broad set of metrics in real time.

Bayesian techniques, however, do not solve the entire problem of data stream mining in a non-stationary setting. There exists a rich field of research on the subject of real-time data stream mining [7]. MCMC is by design a batch process, though it can be applied in a mini-batch fashion. Our current research is incorporating learnings from the stream-mining community in stream classification and drift detection. Additionally, our Data Science and Engineering team has been working on an approach based on Robust Principal Component Analysis (PCA) to deal with high cardinality data. We’re excited to see what comes from this research in 2015.

Conclusion

We have streamlined production operations and improved availability by creating a single directional metric that indicates service health: SPS. We have experimented with and used a number of techniques to derive additional insight from this metric including threshold-based alerting, exponential and double exponential smoothing, and bayesian and stream mining approaches. SPS is the pulse of Netflix streaming, focusing the minds at Netflix on ensuring streaming is working when you want it to be.

If you would like to join us in tackling these kinds of challenges, we are hiring!


References


Nicobar: Dynamic Scripting Library for Java

By James Kojo, Vasanth Asokan, George Campbell, Aaron Tull


The Netflix API is the front door to the streaming service, handling billions of requests per day from more than 1000 different device types around the world. To provide the best experience to our subscribers, it is critical that our UI teams have the ability to innovate at a rapid pace. As described in our blog post a year ago, we developed a Dynamic Scripting Platform that enables this rapid innovation.

Today, we are happy to announce Nicobar, the open source script execution library that allows our UI teams to inject UI-specific adapter code dynamically into our JVM without the API team’s involvement. Named after a remote archipelago in the eastern Indian Ocean, Nicobar allows each UI team to have its own island of code to optimize the client/server interaction for each device, evolved at its own pace.

Background

As of this post’s writing, a single Netflix API instance hosts hundreds of UI scripts, developed by a dozen teams. Together, they deploy anywhere between a handful to a hundred UI scripts per day. A strong, core scripting library is what allows the API JVM to handle this rate of deployment reliably and efficiently.

Our success with the scripting approach in the API platform led us to identify other applications that could benefit also from the ability to alter their behavior without a full scale deployment. Nicobar is a library that provides this functionality in a compact and reusable manner, with pluggable support for JVM languages.

Architecture Overview

Early implementations of dynamic scripting at Netflix used basic java classloader technology to host and sandbox scripts from one another. While this was a good start, it was not nearly enough. Standard Java classloaders can have only one parent, and thus allow only simple, flattened hierarchies. If one wants to share classloaders, this is a big limitation and an inefficient use of memory. Also, code loaded within standard classloaders is fully visible to downstream classloaders. Finer-grained visibility controls are helpful in restricting what packages are exported and imported into classloaders.

Given these experiences, we designed into Nicobar a script module loader that holds a graph of inter-dependent script modules. Under the hood, we use JBoss Modules (which is open source) to create java modules. JBoss modules represent powerful extensions to basic Java classloaders, allowing for arbitrarily complex classloader dependency graphs, including multiple parents. They also support sophisticated package filters that can be applied to incoming and outgoing dependency edges.

A script module provides an interface to retrieve the list of java classes held inside it. These classes can be instantiated and methods exercised on the instances, thereby “executing” the script module.

Script source and resource bundles are represented by script archives. Metadata for the archives is defined in the form of a script module specification, where script authors can describe the content language, inter-module dependencies, import and export filters for packages, as well as user specific metadata.

Script archive contents can be in source form and/or in precompiled form (.class files). At runtime, script archives are converted into script modules by running the archive through compilers and loaders that translate any source found into classes, and then loading up all classes into a module. Script compilers and loaders are pluggable, and out of the box, Nicobar comes with compilation support for Groovy 2, as well as a simple loader for compiled java classes.

Archives can be stored into and queried from archive repositories on demand, or via a continuous repository poller. Out of the box, Nicobar comes with a choice of file-system based or Cassandra based archive repositories. 

As the usage of a scripting system grows in scale, there is often the need for an administrative interface that supports publishing and modifying script archives, as well as viewing published archives. Towards this end, Nicobar comes with a manager and explorer subproject, based on Karyon and Pytheas.

Putting it all together

The diagram below illustrates how all the pieces work together.


Usage Example - Hello Nicobar!

Here is an example of initializing the Nicobar script module loader to support Groovy scripts.

Create a script archive

Create a simple groovy script archive, with the following groovy file:

Add a module specification file moduleSpec.json, along with the source:

Jar the source and module specification together as a jar file. This is your script archive.


Create a script module loader

Create an archive repository

If you have more than a handful of scripts, you will likely need a repository representing the collection. Let’s create a JarArchiveRepository, which is a repository of script archive jars at some file system path. Copy helloworld.jar into /tmp/archiveRepo to match the code below.

Hooking up the repository poller provides dynamic updates of discovered modules into the script module loader. You can wire up multiple repositories to a poller, which would poll them iteratively.

Execute script
Script modules can be retrieved out of the module loader by name (and an optional version). Classes can be retrieved from script modules by name, or by type. Nicobar itself is agnostic to the type of the classes held in the module, and leaves it to the application’s business logic to decide what to extract out and how to execute classes.

Here is an example of extracting a class implementing Callable and executing it:

At this point, any changes to the script archive jar will result in an update of the script module inside the module loader and new classes reflecting the update will be vended seamlessly!

More about the Module Loader

In addition to the ability to dynamically inject code, Nicobar’s module loading system also allows for multiple variants of a script module to coexist, providing for runtime selection of a variant. As an example, tracing code execution involves adding instrumentation code, which adds overhead. Using Nicobar, the application could vend classes from an instrumented version of the module when tracing is needed, while vending classes from the uninstrumented, faster version of the module otherwise. This paves the way for on demand tracing of code without having to add constant overhead on all executions. 

Module variants can also be leveraged to perform slow rollouts of script modules. When a module deployment is desired, a portion of the control flow can be directed through the new version of the module at runtime. Once confidence is gained in the new version, the update can be “completed”, by flushing out the old version and sending all control flow through the new module.

Static parts of an application may benefit from a modular classloading architecture as well. Large applications, loaded into a monolithic classloader can become unwieldy over time, due to an accumulation of unintended dependencies and tight coupling between various parts of the application. In contrast, loading components using Nicobar modules allows for well defined boundaries and fine-grained isolation between them. This, in turn, facilitates decoupling of components, thereby allowing them to evolve independently

Conclusion

We are excited by the possibilities around creating dynamic applications using Nicobar. As usage of the library grows, we expect to see various feature requests around access controls, additional persistence and query layers, and support for other JVM languages.

Project Jigsaw, the JDK’s native module loading system, is on the horizon too, and we are interested in seeing how Nicobar can leverage native module support from Jigsaw.

If these kinds of opportunities and challenges interest you, we are hiring and would love to hear from you!
Viewing all 305 articles
Browse latest View live