Quantcast
Channel: Netflix TechBlog - Medium
Viewing all 305 articles
Browse latest View live

Chelsea: Encoding in the Fast Lane

$
0
0
Back in May Netflix launched its first global talk show: Chelsea. Delivering this new format was a first for us, and a fun challenge in many different aspects, which this blog describes in more detail. Chelsea Handler's new Netflix talk show ushered in a Day-of-Broadcast (DOB) style of delivery that is demanding on multiple levels for our teams, with a lightning-fast tight turnaround time. We looked at all the activities that take place in the Netflix Digital Supply Chain, from source delivery to live-on-site, and gave a time budget for each activity, pushing on all the teams to squeeze their times, aiming at an aggressive overall goal. In this article we explain enhancements and techniques that the encoding team used to successfully process this show faster than ever.

Historically there was not as much pressure on encode times. Our system was optimized for throughput and robustness, paying less attention to speed. In the last few years we had worked to reduce the ingest and encode time to about 2.5 hours. This met the demands of our most stringent use cases like the Day-After-Broadcast delivery of Breaking Bad. Now, Chelsea was pushing us to reduce this time even further. The new aggressive time budget calls for us to ingest and encode a 30 minute title in under 30 minutes. Our solution ends up using about 5 minutes for source inspection and 25 minutes for encoding.
The Starting Point
Although Chelsea challenged us to encode with a significantly shorter turnaround time compared to other movies or shows in our catalog, our work over the last few years on developing a robust and scalable cloud-based system helped jumpstart our efforts to meet this challenge.
Parallel Encoding
In the early days of Netflix streaming, the entire video encode of a title would be generated on a single Windows machine. For some streams (for example, slower codecs or higher resolutions), generating a single encode would take more than 24 hours. We improved on our system a few years ago by rolling out a parallel encoding workflow, which breaks up a title in “chunks” and the chunks can be processed in parallel on different machines. This allowed for shorter latency, especially as the number of machines scale up, and robustness to transient errors. If a machine is unexpectedly terminated, only a small amount of work is lost.
Automated Parallel Inspections
To ensure that we deliver high quality video streams to our members, we have invested in developing automated quality checks throughout the encoding pipeline. We start with inspecting the source “mezzanine” file to make sure that a pristine source is ingested into the system. Types of inspections include detection of wrong metadata, picture corruption, insertion of extra content, frame rate conversion and interlacing artifacts. After generating a video stream, we verify the encodes by inspecting the metadata, comparing the output video to the mezzanine fingerprint and generating quality metrics. This enables us to detect issues caused by glitches on the cloud instances or software implementation bugs. Through automated inspections of the encodes we can detect output issues early on, without the video having to reach manual QC. Just as we do encoding in parallel by breaking the source into chunks, likewise we can run our automated inspections in parallel by chunking the mezzanine file or encoded video.
Internal Spot Market
Since automated inspections and encoding are enabled to run in parallel in a chunked model, increasing the number of available instances can greatly reduce end-to-end latency. We recently worked on a system to dynamically leverage unused Netflix-reserved AWS servers during off-peak hours. The additional cloud instances, not used by other Netflix services, allowed us to expedite and prioritize encoding of Chelsea’s show.
Priority Scheduling
Encoding jobs can come in varying priorities from highly urgent (e.g. DOB titles, or interactive jobs submitted by humans) to low priority background backfill. Within the same title, certain codecs and bitrates rank higher in priority than others so that required bitrates necessary to go live are always processed first. To handle the fine grain and dynamic nature of job priority, the encoding team developed a custom priority messaging service. Priorities are broadly classified by priority class that models after the US Postal service classes of mail, and fine grain job priority is expressed by a due date. Chelsea belongs to the highest priority class, Express (sorry, no Sunday delivery). With the axiom that “what’s important is needed yesterday”, all Chelsea show jobs are due 30 years ago!
Innovations Motivated by Chelsea
As we analyzed our entire process looking for ways to make the process faster, it was apparent that DOB titles have different goals and characteristics than other titles. Some improvement techniques would only be practical on a DOB title, and others that might make sense on ordinary titles may only be practical on the smaller scale of DOB titles and not on the scale of the entire catalog. Low latency is often at odds with high throughput, and we still have to support an enormous throughput. So understand that the techniques described here are used selectively on the most urgent of titles.

When trying to make anything faster we consider these standard approaches:
  1. Use phased processing to postpone blocking operations
  2. Increase parallelism
  3. Make it plain faster
We will mention some improvements from each of these categories.
Phased Processing
Inspections
Most sources for Netflix originals go through a rigorous set of inspections after delivery, both manual and automated. First, manual inspections happen on the source delivered to us to check if it adheres to the Netflix source guidelines. With Chelsea, this inspection begins early with the pre-taped segments being inspected well before the show itself is taped. Then, inspections are done during taping and again during the editorial process, right on set. By the time it is delivered, we are confident that it needs no further manual QC because exhaustive QC was performed at post.
We have control over the source production; it is our studio and our crew and our editing process. It is well-rehearsed and well-known. If we assume the source is good, we can bypass the automated inspections that focus on errors introduced by the production process. Examples of inspections typically done on all sources are detection of telecine, interlacing, audio hits, silence in audio and bad channel mapping. Bypassing the most expensive inspections, such as deep audio inspections, allowed us to bring the execution time down from 30 minutes to about 5 minutes on average. Aside from detecting problems, the inspection stage generates artifacts that are necessary for the encoding processing. We maintain all inspections that produce these artifacts.
Complexity Analysis
A previous article described how we use an encoding recipe uniquely tailored to each title. The first step in this per-title encode optimization is complexity analysis, an expensive examination of large numbers of frames to decide on a strategy that is optimal for the title.
For a DOB title, we are willing to release it with a standard set of recipes and bitrates, the same way Netflix had delivered titles for years. This standard treatment is designed to give a good experience for any show and does an adequate job for something like Chelsea.
We launch an asynchronous job to do the complexity analysis on Chelsea, which will trigger a re-encode and produce streams with ultimate efficiency and quality. We are not blocked on this. If it is not finished by the show start date, the show will still go live with the standard streams. Sometime later the new streams will replace the old.
Increase Parallelism
Encoding in Chunks
As mentioned earlier, breaking up a video into small chunks and encoding different chunks in parallel can effectively reduce the overall encoding time. At the time the DOB project started, we still had a few codecs that were processed as a single chunk, such as h263. We took this opportunity to create a chunkable process for these remaining codecs.
Optimized Encoding Chunk Size
For DOB titles we went more extreme. After extensive testing with different chunk sizes, we discovered that by reducing the chunk size from our previous standard of 3 minutes to 30 seconds we can cut down the encoding time by 80% without noticeable video quality degradation.

More chunks means more overhead so for normal titles we stick to a 3 minute chunk size. For DOB titles we are willing to pay the increased overhead.
Reduce Dependency in Steps
Some older codec formats (for example, VC1), used by legacy TVs and Bluray players, were being encoded from lightly-compressed intermediate files. This meant we could not begin encoding these streams (which were one of the slower processes in our pipeline) until we had finished the intermediate encode. We changed our process to do the legacy streams directly from the source so that we did not have to wait for the intermediate step.
Make It Faster
Infrastructure Enhancements
Once an AV source is entered into the encoding system, we encode it with a number of codecs, resolutions, and bitrates for all Netflix playback devices. To meet the SLA for DOB encoding and be as fast as possible, we need to run all DOB encoding jobs in parallel without waiting. We have an extra challenge that with a finer grain of chunk size used for DOB, there are more jobs that need to be run in parallel.
Right Sizing
The majority of the computing resources are spent on video encoding. A relatively small percentage of computing is spent on source inspection, audio, subtitle, and other assets. It is easy to pre-scale the production environment for these smaller activities. On the other hand, with a 30 second chunk size, we drastically increased the number of parallel video encoding activities. For a 30 minute Chelsea episode, we estimated a need of 1,000 video encoders to compute all codecs, resolutions, and bit rates at the same time. For the video encoders, we make use of internal spot market, the unused Netflix reserved instances, to achieve this high instance count.
Warm Up
The resource scheduler normally samples the work queues and autoscales video encoders based on workload at the moment. Scaling Amazon EC2 instances takes time. The amount of time to scale depends on many factors and is something that could prevent us from achieving the proper SLA for encoding a DOB title. Pre-scaling 1,000 video encoders eliminates the scaling time penalty when a DOB title arrives. It is uneconomical to keep 1,000 video encoders 24x7 regardless of workload.
To strike a balance, we introduce a warm-up mechanism. We pre-scale 1,000 video encoders at the earliest signal of an imminent DOB title arrival, and keep them around for an hour. The Netflix ingest pipeline sends a notification to the resource scheduler whenever we start to receive a DOB title from the on set. Upon receiving the notification, the resource scheduler immediately procures 1,000 video encoders spread out over many instance types and zones (e.g. r3.2xlarge on us-east-1e) parallelizing instance acquisition and reduce the overall time. This strategy also mitigates the risk of running out of a specific instance type and availability zone combination.
Priority and Job Preemption
Since the warm-up comes in advance of having actual DOB jobs, the video encoders will busy themselves with existing encode jobs. By the time the DOB Express priority jobs arrive, it is possible that a video encoder already has a lower priority job in-flight. We can't afford to wait for these jobs to finish before beginning the Express priority jobs. To mitigate this scenario, we enhanced our custom-built priority messaging service with job preemption where a high priority job such as a Chelsea video encode interrupts a lower priority job.
Empirical data shows that all DOB jobs are picked up within 60 seconds on average.
The Fast Lane
We examined all the interactions with other systems and teams and identified latencies that could be improved. Like the encoding system, other systems at Netflix were also designed for high throughput, while latency was a secondary concern. For all systems overall, throughput remains a top priority and we cannot sacrifice in this area. In many cases it was not practical to improve the latency of interactions for all titles while satisfying throughput demands so we developed special fast lane communications. Communications about DOB titles follow a different path, one that leads to lower latency.
Conclusion
We achieved our goal of reducing the time for ingest and encode to be approximately the runtime of the source, i.e. 30 minutes for a 30 minute source. We were pleased that the architecture we put in place in recent years that emphasizes flexibility and configurability provided a great foundation for building out the DOB process. This investment paid off by allowing us to quickly respond to business demands and effectively deliver our first talk show to members all around the world.

By Rick Wong, Zhan Chen, Anne Aaron, Megha Manohara, and Darrell Denlinger

Netflix Billing Migration to AWS - Part II

$
0
0

This is a continuation in the series on Netflix Billing migration to the Cloud. An overview of the migration project was published earlier here. This post details the technical journey for the Billing applications and datastores as they were moved from the Data Center to AWS Cloud.

As you might have read in earlierNetflix Cloud Migration blogs, all of Netflix streaming infrastructure is now completely run in the Cloud. At the rate Netflix was growing, especially with the imminent Netflix Everywhere launch, we knew we had to move Billing to the Cloud sooner than later else our existing legacy systems would not be able to  scale.

There was no doubt that it would be a monumental task of moving highly sensitive applications and critical databases without disrupting the business, while at the same time continuing to build the new business functionality and features.

A few key responsibilities and challenges for Billing:

  • The Billing team is responsible for the financially critical data in the company. The data we generate on a daily basis for subscription charges, gift cards, credits, chargebacks, etc. is rolled up to finance and is reported into the Netflix accounting. We have stringent SLAs on our daily processing to ensure that the revenue gets booked correctly for each day. We cannot tolerate delays in processing pipelines.
  • Billing has zero tolerance for data loss.
  • For most parts, the existing data was structured with a relational model and necessitates use of transactions to ensure an all-or-nothing behavior. In other words we need to be ACID for some operations. But we also had use-cases where we needed to be highly available across regions with minimal replication latencies.
  • Billing integrates with the DVD business of the company, which has a different architecture than the Streaming component, adding to the integration complexity.
  • The Billing team also provides data to support Netflix Customer Service agents to answer any member billing issues or questions. This necessitates providing Customer Support with a comprehensive view of the data.

The way the Billing systems were, when we started this project, is shown below.
Canvas 1.png
  • 2 Oracle databases in the Data Center - One storing the customer subscription information and other storing the invoice/payment data.
  • Multiple REST-based applications - Serving calls from the www.netflix.com and Customer support applications. These were essentially doing the CRUD operations
  • 3 Batch applications -
  • Subscription Renewal - A daily job that looks through the customer base to determine the customers to be billed that day and the amount to be billed by looking at their subscription plans, discounts, etc.
  • Order & Payment Processor - A series of  batch jobs that create an invoice to charge the customer to be renewed and process the invoice through various stages of the invoice lifecycle.
  • Revenue Reporting - A daily job that looks through billing data and generates reports for the Netflix Finance team to consume.
  • One Billing Proxy application (in the Cloud) - used to route calls from rest of Netflix applications in the Cloud to the Data Center.
  • Weblogic queues with legacy formats being used for communications between processes.

The goal was to move all of this to the Cloud and not have any billing applications or databases in the Data Center. All this without disrupting the business operations. We had a long way to go!
The Plan

We came up with a 3-step plan to do it:
  • Act I - Launch new countries directly in the Cloud on the billing side while syncing the data back to the Data Center for legacy batch applications to continue to work.
  • Act II - Model the user-facing data, which could live with eventual consistency and does not need to be ACID, to persist to Cassandra (Cassandra gave us the ability to perform writes in one region and make it available in the other regions with very low latency. It also gives us high-availability across regions).
  • Act III - Finally move the SQL databases to the Cloud.
In each step and for each country migration, learn from it, iterate and improve on it to make it better.
Act I – Redirect new countries to the Cloud and sync data to the Data Center
Netflix was going to launch in 6 new countries soon. We decided to take it as a challenge to launch these countries partly in the Cloud on the billing side. What that meant was the user-facing data and applications would be in the Cloud, but we would still need to sync data back to the Data Center so some of our batch applications which would continue to run in the Data Center for the time-being, could work without disruption. The customer for these new countries data would be served out of the Cloud while the batch processing would still run out of the Data Center. That was the first step.
We ported all the APIs from the 2 user-facing applications to a Cloud based application that we wrote using Spring Boot and Spring Integration. With Spring Boot, we were able to quickly jump-start building a new application, as it provided the infrastructure and plumbing we needed to stand it up out of the box and let us focus on the business logic. With Spring Integration we were able to write once and reuse a lot of the workflow style code. Also with headers and header-based routing support that it provided, we were able to implement a pub-sub model within the application to put a message in a channel and have all consumers consume it with independent tuning for each consumer. We were now able to handle the API calls for members in the 6 new countries in any AWS region with the data stored in Cassandra. This enabled Billing to be up for these countries even if an entire AWS region went down – the first time we were able to see the power of being on the Cloud!

We deployed our application on EC2 instances in AWS in multiple regions. We added a redirection layer in our existing Cloud proxy application to switch billing calls for users in the new countries to go to the new billing APIs in the Cloud and billing calls for the users in the existing countries to continue to go to the old billing APIs in the Data Center. We opened direct connectivity from one of the AWS regions to the existing Oracle databases in the Data Center and wrote an application to sync the data from Cassandra via SQS in the 3 regions back to this region. We used SQS queues and Dead Letter Queues (DLQs) to move the data between regions and process failures.
New country launches usually mean a bump in member base. We knew we had to move our Subscription Renewal application from the Data Center to the Cloud so that we don’t put the load on the Data Center one. So for these 6 new countries in the Cloud, we wrote a crawler that went through all the customers in Cassandra daily and came up with the members who were to be charged that day. This all row iterator approach would work for now for these countries, but we knew it wouldn’t hold ground when we migrated the other countries and especially the US data (which had majority of our members  at that time) to the Cloud. But we went ahead with it for now to test the waters. This would be the only batch application that we would run from the Cloud in this stage.
We had chosen Cassandra as our data store to be able to write from any region and due to the fast replication of the writes it provides across regions. We defined a data model where we used the customerId as the key for the row and created a set of composite Cassandra columns to enable the relational aspect of the data. The picture below depicts the relationship between these entities and how we represented them in a single column family in Cassandra. Designing them to be a part of a single column family helped us achieve transactional support for these related entities.


We designed our application logic such that we read once at the beginning of any operation, updated objects in memory and persisted it to a single column family at the end of the operation. Reading from Cassandra or writing to it in the middle of the operation was deemed an anti-pattern.  We wrote our own custom ORM using Astyanax (a Netflix grown and open-sourced Cassandra client) to be able to read/write the domain objects from/to Cassandra.

We launched in the new countries in the Cloud with this approach and after a couple of initial minor issues and bug fixes, we stabilized on it. So far so good!
The Billing system architecture at the end of Act I was as shown below:
Canvas 2.png
Act II – Move all applications and migrate existing countries to the cloud
With Act I done successfully, we started focusing on moving the rest of the apps to the Cloud without moving the databases. Most of the business logic resides in the batch applications, which had matured over years and that meant digging into the code for every condition and spending time to rewrite it. We could not simply forklift these to the Cloud as is. We used this opportunity to remove dead code where we could, break out functional parts into their own smaller applications and restructure existing code to scale. These legacy applications were coded to read from config files on disk on startup and use other static resources like reading messages from Weblogic queues -  all anti-patterns in the Cloud due to the ephemeral nature of the instances. So we had to re-implement those modules to make the applications Cloud-ready. We had to change some APIs to follow an async pattern to allow moving the messages through the queues to the region where we had now opened a secure connection to the Data Center.
The Cloud Database Engineering (CDE) team setup a multi node Cassandra cluster for our data needs. We knew that the all row Cassandra iterator Renewal solution that we had implemented for renewing customers from earlier 6 countries would not scale once we moved the entire Netflix member billing data to Cassandra. So we designed a system to use Aegisthus to pull the data from Cassandra SSTables and convert it to JSON formatted rows that were staged out to S3 buckets. We then wrote Pig scripts to run mapreduce on the massive dataset everyday to fetch customer list to renew and charge for that day. We also wrote Sqoop jobs to pull data from Cassandra and Oracle and write to Hive in a queryable format which enabled us to join these two datasets in Hive for faster troubleshooting.

To enable DVD servers to talk to us in the Cloud, we setup load balancer endpoints (with SSL client certification) for DVD to route calls to us through the Cloud proxy, which for now would pipe the call back to the Data Center, until we migrated US. Once US data migration was done, we would sever the Cloud to Data Center communication link.

To validate this huge data migration, we wrote a comparator tool to compare and validate the data that was migrated to the Cloud, with the existing data in the Data Center. We ran the comparator in an iterative format, where we were able to identify any bugs in the migration, fix them, clear out the data and re-run. As the runs became clearer and devoid of issues, it increased our confidence in the data migration. We were excited to start with the migration of the countries. We chose a country with a small Netflix member base as the first country and migrated it to the Cloud with the following steps:

  • Disable the non-GET apis for the country under migration. (This would not impact members, but delay any updates to subscriptions in billing)
  • Use Sqoop jobs to get the data from Oracle to S3 and Hive.
  • Transform it to the Cassandra format using Pig.
  • Insert the records for all members for that country into Cassandra.
  • Enable the non-GET apis to now serve data from the Cloud for the country that was migrated.

After validating that everything looked good, we moved to the next country. We then ramped up to migrate set of similar countries together. The last country that we migrated was US, as it held most of our member base and also had the DVD subscriptions. With that, all of the customer-facing data for Netflix members was now being served through the Cloud. This was a big milestone for us!
After Act II, we were looking like this:
Canvas 3.png
Act III  – Good bye Data Center!
Now the only (and most important) thing remaining in the Data Center was the Oracle database. The dataset that remained in Oracle was highly relational and we did not feel it to be a good idea to model it to a NoSQL-esque paradigm. It was not possible to structure this data as a single column family as we had done with the customer-facing subscription data. So we evaluated Oracle and Aurora RDS as possible options. Licensing costs for Oracle as a Cloud database and Aurora still being in Beta didn’t help make the case for either of them.

While the Billing team was busy in the first two acts, our Cloud Database Engineering team was working on creating the infrastructure to migrate billing data to MySQL instances on EC2. By the time we started Act III, the database infrastructure pieces were ready, thanks to their help. We had to convert our batch application code base to be MySQL-compliant since some of the applications used plain jdbc without any ORM. We also got rid of a lot of the legacy pl-sql code and rewrote that logic in the application, stripping off dead code when possible.
Our database architecture now consists of a MySQL master database deployed on EC2 instances in one of the AWS regions. We have a Disaster Recovery DB that gets replicated from the master and will be promoted to master if the master goes down. And we have slaves in the other AWS regions for read only access to applications.
Our Billing Systems, now completely in the Cloud, look like this:
Canvas 4.png
Needless to say, we learned a lot from this huge project. We wrote a few tools along the way to help us debug/troubleshoot and improve developer productivity. We got rid of old and dead code, cleaned up some of the functionality and improved it wherever possible. We received support from many other engineering teams within Netflix. We had engineers from the Cloud Database Engineering, Subscriber and Account engineering, Payments engineering, Messaging engineering worked with us on this initiative for anywhere between 2 weeks to a couple of months. The great thing about the Netflix culture is that everyone has one goal in mind – to deliver a great experience for our members all over the world. If that means helping Billing solution move to the Cloud, then everyone is ready to do that irrespective of team boundaries!
The road ahead …
With Billing in the Cloud, Netflix streaming infrastructure now completely runs in the Cloud. We  can scale any Netflix service on demand, do predictive scaling based on usage patterns, do single-click deployments usingSpinnaker and have consistent deployment architectures between various Netflix applications. Billing infrastructure can now make use of all the Netflix platform libraries and frameworks for monitoring and tooling support in the Cloud. Today we support billing for over 81 million Netflix members in 190+ countries. We generate and churn through terabytes of data everyday to accomplish  billing events. Our road ahead includes rearchitecting membership workflows for a global scale and business challenges. As part of our new architecture, we would be redefining our services to scale natively in the Cloud.  With the global launch, we have an opportunity to learn and redefine Billing and Payment methods in newer markets and integrate with many global  partners and local payment processors in the regions. We are looking forward to architect more functionality and scale out further.

If you like to design and implement large-scale distributed systems for critical data and build automation/tooling for testing it, we have a couple of positions open and would love to talk to you! Check out the positions here :

Distributed Resource Scheduling with Apache Mesos

$
0
0

Netflix uses Apache Mesos to run a mix of batch, stream processing, and service style workloads. For over two years, we have seen an increased usage for a variety of use cases including real time anomaly detection, training and model building batch jobs, machine learning orchestration, and Node.js based microservices. The recent release of Apache Mesos 1.0 represents maturity of the technology that has evolved significantly since we first started to experiment with it.

Our initial use of Apache Mesos was motivated by fine grained resource allocation to tasks of various sizes that can be bin packed to a single EC2 instance. In the absence of Mesos, or a similar resource manager, we would have had to forego fine grained allocation for increased number of instances with suboptimal usage, or develop a technology similar to Mesos, or at least a subset of it.

The increasing adoption of containers for stream processing and batch jobs continues to drive usage in Mesos-based resource scheduling. More recently, developer benefits from working with Docker-based containers brought in a set of service style workloads on Mesos clusters. We present here an overview of some of the projects using Apache Mesos across Netflix engineering. We show the different use cases they address and how they each use the technology effectively. For further details on each of the projects, we provide links to other posts in sections below.

Cloud native scheduling using Apache Mesos

In order to allocate resources from various EC2 instances to tasks, we need a resource manager that makes the resources available for scheduling and carries out the logistics of launching and monitoring the tasks over a distributed set of EC2 instances. Apache Mesos separates out resource allocation to “frameworks” that wish to use the cluster, from scheduling of resources to tasks by the frameworks. While Mesos determines how many resources are allocated to a framework, the framework’s scheduler determines which resources to assign to which tasks, and when. The schedulers are presented a relatively simple API so they can focus on the scheduling logic and react to failures, which are inevitable in a distributed system. This allows users to write different schedulers that cater to various use cases, instead of Mesos having to be the single monolithic scheduler for all use cases. The diagram below from Mesos documentation shows “Framework 1” receiving an offer from “Agent 1” and launching two tasks.
MesosArchitecture.jpg
The Mesos community has seen multiple schedulers developed over time that cater to specific use cases and present specific APIs to their users.

Netflix runs various microservices in an elastic cloud, AWS EC2. Operating Mesos clusters in a cloud native environment required us to ensure that the schedulers can handle two aspects in addition to what the schedulers that operate in a data center environment do - increased ephemerality of the agents running the tasks, and the ability to autoscale the Mesos agent cluster based on demand. Also, the use cases we had in mind called for a more advanced scheduling of resources than a first fit kind of assignment. For example, bin packing of tasks to agents by their use of CPUs, memory, and network bandwidth in order to minimize fragmentation of resources. Bin packing also helps us free up as many agents as possible to ease down scaling of the agent cluster by terminating idle agents without terminating running tasks.

Identifying a gap in such capabilities among the existing schedulers, last year we contributed a scheduling library called Fenzo. Fenzo autoscales the agent cluster based on demand and assigns resources to tasks based on multiple scheduling objectives composed via fitness criteria and constraints. The fitness criteria and the constraints are extensible via plugins, with a few common implementations built in, such as bin packing and spreading tasks of a job across EC2 availability zones for high availability. Any Mesos framework that runs on the JVM can use the Fenzo Java library.

Mesos at Netflix

Here are three projects currently running Apache Mesos clusters.

Mantis

Mantis is a reactive stream processing platform that operates as a cloud native service with a focus on operational data streams. Mantis covers varied use cases including real-time dashboarding, alerting, anomaly detection, metric generation, and ad-hoc interactive exploration of streaming data. We created Mantis to make it easy for teams to get access to real-time events and build applications on top of them. Currently, Mantis is processing event streams of up to 8 million events per second and running hundreds of stream-processing jobs around the clock. One such job focuses on individual titles, processing fine-grained insights to figure out if, for example, there are playback issues with House of Cards, Season 4, Episode 1 on iPads in Brazil. This amounts to tracking millions of unique combinations of data all the time.

The Mantis platform comprises a master and an agent cluster. Users submit stream-processing applications as jobs that run as one or more workers on the agent cluster. The master uses the Fenzo scheduling library with Apache Mesos to optimally assign resources to a job’s workers. One such assignment objective places perpetual stream processing jobs on agents separate from those running transient interactive jobs. This helps scale down of the agent cluster when the transient jobs complete. The below diagram shows Mantis architecture. Workers from the various jobs may run on the same agent using Cgroups based resource isolation.

MantisArchitectureImage.png

Titus

Titusis a Docker container job management and execution platform. Initially, Titus served batch jobs that included algorithm training (similar titles for recommendations, A/B test cell analysis, etc.) as well as hourly ad-hoc reporting and analysis jobs. More recently, Titus has started to support service style jobs (Netflix microservices) that are in need of a consistent local development experience as well as more fine grained resource management. Titus' initial service style use is for the API re-architecture using server side NodeJS.


The above architecture diagram for Titus shows its Master using Fenzo to assign resources from Mesos agents. Titus provides tight integration into the Netflix microservices and AWS ecosystem, including integrations for service discovery, software based load balancing, monitoring, and our CI/CD pipeline, Spinnaker. The ability to write custom executors in Mesos allows us to easily tune the container runtime to fit in with the rest of the ecosystem.

Meson

Meson is a general purpose workflow orchestration and scheduling framework that was built to manage machine learning pipelines.

Meson caters to a heterogeneous mix of jobs with varying resource requirements for CPU, memory, and disk space. It supports the running of Spark jobs along with other batch jobs in a shared cluster. Tasks are resource isolated on the agents using Cgroups based isolation. The Meson scheduler evaluates readiness of tasks based on a graph and launches the ready tasks using resource offers from Mesos. Failure handling includes re-launching failed tasks as well as terminating tasks determined to have gone astray.
MesonArchitecture.png

The above diagram shows Meson’s architecture. The Meson team is currently working on enhancing its scheduling capabilities using the Fenzo scheduling library.  

Continuing work with Apache Mesos

As we continue to evolve Mantis, Titus, and Meson projects, Apache Mesos provides a stable, reliable, and scalable resource management platform. We engage with the Mesos community through our open source contribution, Fenzo, and by exchanging ideas at MesosCon conferences - connect with us at the upcoming MesosCon Europe 2016, or see our past sessions from 2014, 2015, and earlier this year (Lessons learned and Meson).

Our future work on these projects includes adding SLAs (service level agreements - such as disparate capacity guarantees for service and batch style jobs), security hardening of agents and containers, increasing operational efficiency and visibility, and adoption across a broader set of use cases. There are exciting projects in the pipeline across Mesos, Fenzo, and our frameworks to make these and other efforts successful.

If you are interested in helping us evolve our resource scheduling and container deployment projects, join our Container Platform or Personalization Infrastructure teams.

  • Sharma Podila, Andrew Spyker, Neeraj Joshi, Antony Arokiasamy

Netflix Billing Migration to AWS - Part III

$
0
0

In the billing migration blogpost published a few weeks ago, we explained the overall approach employed in migrating our billing system to the cloud. In this post, the database migration portion will be covered in detail. We hope that our experiences will help you as you undertake your own migrations.



Have you ever wondered about the elements that need to come together and align to get a complicated database migration right? You might ask, “What makes it complicated?”


Think of any challenge in database migration and pretty much all of them were there in this migration:


  • Different hardware between source and target
  • OS flavours
  • Migration across heterogeneous databases
  • Multiple datacenters - Netflix data center (DC) and AWS cloud
  • Criticality of the transactional billing data
  • Selective dataset migration
  • Migration of constantly changing data, with minimal downtime


Billing, as most of you would agree, is the critical service for any company.  The database is the most essential element in any migration and getting it right determines the success or failure of the whole project. The Netflix CDE (Cloud Database Engineering) team was tasked with migrating this critical subsystem database. The following sections describe some of key areas we focused on in order to ensure a successful migration.

Choice of the database

Billing applications have transactions that need ACID compliance to process the payment for charged transactions. RDBMS seemed the right choice for the datastore.
Screen Shot 2016-07-28 at 11.24.24 AM.png


Oracle: As source database was in Oracle, migrating to Oracle in Cloud would avoid cross database migration, simplifying the coding effort and configuration setup. Our experience with Oracle in production gave more confidence with respect to its performance and scalability. However, considering  the licensing costs and the technical debt required to migrate legacy data “as is”, prompted us to explore other options.


AWS RDS MySQL: Ideally we would have gone with MySQL RDS as our backend, considering Amazon does a great job in managing and upgrading relational database as a service, providing multi-AZ support for high availability. However, the main drawback to RDS was the storage limit of 6TB. Our requirement at the time, was closer to 10TB.


AWS Aurora: AWS Aurora would have met the storage needs, but it was in beta at that time.


PostgreSQL: PostgreSQL is a powerful open source, object-relational database system, but we did not have much in house expertise using PostgreSQL. In the DC, our primary backend databases were Oracle and MySQL. Moreover, choosing PostgreSQL would have eliminated the option of a seamless migration to Aurora in future, as Aurora is based on the MySQL engine.


EC2 MySQL: EC2 MySQL was ultimately the choice for the billing use case, since there were no licensing cost and it also provided a path to future Aurora migration. This involved setting up MySQL using the InnoDB engine on i2.8xlarge instances.

Production Database Architecture

High availability and scalability were the main drivers in designing the architecture to help the billing application withstand infrastructure failures, zone and region outages, and to do so with minimal downtime.


Using an DRBDcopy in another zone for the primary master DB, helped withstand zone outages, infrastructure failures like bad nodes, and EBS volume failures. “Synchronous replication protocol” was used to enable the write operations on the primary node to be considered completed, only after both the local and remote writes have been confirmed. As a result, the loss of a single node is guaranteed to have no data loss. This would impact the write latency, but that was well within the SLAs.


Read replica setup in local, as well as cross region, not only met high availability requirements, but also helped with scalability. The read traffic from ETL jobs was diverted to the read replica, sparing the primary database from heavy ETL batch processing.


In case of the primary MySQL database failure, a failover is performed to the DRBD secondary node that was being replicated in synchronous mode.  Once secondary node takes over the primary role, the route53 DNS entry for database host is changed to point to the new primary. The billing application being  batch in nature is designed to handle such downtime scenarios.  The client connection do not fallback but establish new connections that would point to the new primary after the Cname propagation is complete.
Screen Shot 2016-06-27 at 3.01.48 PM.png


Choice of Migration Tool

We spent considerable time and effort in choosing the right tool for the migration. Primary success criteria for the POC was the ability to restart bulk loads, bi-directional replication, and data integrity. We focused on the following criteria while evaluating a tool for the migration.


  • Restart bulk/incremental loads
  • Bi-directional replication
  • Parallelism per table
  • Data integrity
  • Error reporting during transfer
  • Ability to rollback after going live
  • Performance
  • Ease of use


GoldenGate stood out in terms of features it offered which aligned very well with our use case. It offered the ability to restart bulk loads in case of failures (a few tables were hundreds of GB in size), and its bi-directional replication feature provided easy rollback from MySQL to Oracle.


The main drawback with GoldenGate was the learning curve in understanding how the tool works. In addition, its manual configuration setup is prone to human error, which added a layer of difficulty. If there is no primary key or unique key on the source table, GoldenGate uses all columns as the supplemental logging key pair for both extracts and replicats. We found issues like duplicate data at the target in incremental loads for such tables and decided to execute a full load during the cutover for those specific tables with no pre-defined primary or unique key. The advantages and features offered by GoldenGate far exceeded any challenges and was the tool of choice.

Schema Conversion and Validation

Since source and target databases were different, with data type and data length differences, validation became a crucial step in getting the data migrated while keeping the data integrity intact.


Data type mismatch took sometime to fix the issues stemming from it. One example - many numeric values in Oracle were defined as the Number datatype for legacy reasons. There is no equivalent type in MySQL. The Number datatype in Oracle stores fixed and floating-point numbers which was tricky.  Some source tables had columns where Number meant an integer, in other cases it was used for decimal values, while some had really long values up to 38 digits. In contrast, MySQL has specific datatypes like Int, bigInt, decimal, double etc and a bigInt cannot go beyond 18 digits. One should ensure that correct mapping is done to reflect the accurate values in MySQL.


Partitioned tables needed special handling, since unlike Oracle, MySQL expects the partition key to be the part of the primary key and unique key. Target schema had to be redefined with proper partitioning keys to ensure no negative impact on application logic and queries.


Defaultvalue handling also differs between MySQL and Oracle. For the columns with a NOT NULL value, MySQL determined the implicit default value for the column. Strict mode had to be enabled in MySQL to catch such data conversion issues, as such transactions would fail and show up in the GoldenGate error logs.


Tools for schema conversion : We researched a variety of tools to assist in schema conversion as well as validation, but the default schema conversion provided by these tools did not work due to our legacy schema design. Even GoldenGate does not convert Oracle schema to the equivalent MySQL version, but instead depends on the application owners to define the schema first. Since one of our goals with this migration was to optimize schema, the database and application teams worked together to review the data types, and did multiple iterations to capture any mismatch. GoldenGate will truncate the value to fit the MySQL datatype in case of a mis-match. We relied heavily on data comparison tools and the GoldenGate error logs to help detect mismatches in data type mapping between source and target, in order to mitigate this issue.

Data Integrity

Once the full load completed and incrementals caught up, another daunting task was to make sure the target copy correctly maintained the data integrity. As the data types between Oracle and MySQL were different, it was not possible to have a generic wrapper script to compare hash values for the rowkeys to ensure accuracy. There are a few 3rd party tools which do the data comparisons across databases comparing the actual values, but the total dataset is 10 TB which was not easy to compare. Instead, we used these tools to match a sample data set which helped in identifying a few discrepancies related to wrong schema mapping.


Test refreshes: One of the ways to ensure data integrity was to do the application testing on a copy of the production database. This was accomplished by scheduling database refreshes from the MySQL production database to test.Considering production was being backed by EBS for storage, a test environment was easily created by taking the EBS snapshots off the slave, and doing a point in time recovery into test. This process was repeated several times to ensure data quality.


Sqoop jobs: ETL jobs and reporting were used to help with data reconciliation process.  Sqoop jobs pulled data out of Oracle for reporting purposes. Those jobs were also configured to run against MySQL. With continuous replication between source and target, reports were run against specific time window on the ETLs. This helped in taking out the variation due to incremental loads.


Row counts was another technique used to compare the source/target and match them. This was achieved by pausing the incremental loads on the target and matching the counts on Oracle and MySQL. Results from row counts were also compared after full GoldenGate load of the tables.

Performance Tuning

Infrastructure: Billing application persisted data in the DC on two Oracle databases residing on very powerful machines, using IBM power 7, 32 dual core 64 bit multiprocessors, 750GB RAM, TB’s storage allocated via SVC MCS cluster which is 8G4 cluster with 4GB/sec interface running with RAID 10 configurations.


One major concern with the migration was performance, as the target database was consolidated on one i2.8xlarge server, using 32 vCPU and 244 GB RAM. The application team did a lot of tuning at the application layer to optimize the queries. With the help of Vector the performance team was able to find bottlenecks and eliminate them by tuning specific system and kernel parameters. See Appendix for more details.


High performance with respect to reads and writes was achieved by using RAID0 with EBS provisioned IOPS volumes. To get more throughput per volume, 5 volumes of 4TB each were used, instead of 1 big volume. This was to facilitate faster snapshots and restores.


Database: One major concern using MySQL was the scale of our data and MySQL throughput during batch processing of data by billing applications. Percona provided consulting support, and the MySQL database was tuned to perform well during and after the migration. The main trick is to have two cnf files, one while migrating the data and tweaking parameters like innodb_log_file_size to help with bulk inserts, and the second cnf file for the real time production application load by tweaking parameters like innodb_buffer_pool_instances to help with the transaction real time load.  See Appendix for more details.


Data load: During POC, we tested the initial table load with indexes in on/offcombination and decided to go with enabling all indexes before the load. The reasoning behind this was that index creation in MySQL is single threaded (most tables had multiple indexes), and so we instead utilized Golden Gate's parallel load feature to populate the table with indexes in reasonable time. Foreign key constraints were enabled during the final cutover.


Another trick we learned was to match the total number of processes executing full and incremental load, to the number of cores on the instance. If the processes exceeded the number of cores, the performance of those data loads slowed down drastically as the instance would spend a lot of time in context switches. It took around 2 weeks to populate 10 TB in target MySQL database with the full loads and have incremental loads catch up.

Conclusion

Though the database piece is one of the most challenging aspects of any migration, what really makes a difference between success and failure is ensuring you are investing in the right approach up front, and partnering closely with the application team throughout the process. Looking back on the whole migration, it was truly a commendable effort by different teams across the organization, who came together to define the whole migration and make the migration a great success! Along with the individual and cross team coordination, it's also the great culture of freedom and responsibility which makes these challenging migrations possible without impacting business.

APPENDIX

Database Tunables for Bulk Insert


Tunable
Remarks
innodb_log_file_size
The size in bytes of each log file in a log group. Increased from default size to support high WRITE throughput.
innodb_lru_scan_depth
Background operation performed once a second. If you have spare I/O capacity under a typical workload, increase the value.
innodb_adaptive_hash_index
Dynamically enable or disable adaptive hash indexing to improve query performance. Disabled this parameter for bulk insert.
innodb_flush_neighbors
Specifies whether flushing a page from the InnoDB buffer pool also flushes other dirty pages in the same extent. You can turn this setting off to spread out the write operations. Turned off this parameter to improve I/O performance.
transaction-isolation
READ-COMMITTED- Each consistent read, even within the same transaction, sets and reads its own fresh snapshot.
query_cache_size
Turn OFF query_cache helped in our use case
innodb_doublewrite
If this variable is enabled (the default), InnoDB stores all data twice, first to the doublewrite buffer, then to the actual data files. Turn OFF while during bulk insert.


Database Tunables for High Transaction throughput


Tunable
Remarks
innodb_log_file_size
The size in bytes of each log file in a log group.
innodb_max_dirty_pages_pct
The innodb_max_dirty_pages_pct setting establishes a target for flushing activity.
innodb_buffer_pool_instances
The number of regions that the InnoDB buffer pool is divided into. For systems with buffer pools in the multi-gigabyte range, dividing the buffer pool into separate instances can improve concurrency, by reducing contention as different threads read and write to cached pages.
query_cache_size
Turn OFF query_cache helped in our use case
innodb_adaptive_hash_index
Dynamically enable or disable adaptive hash indexing to improve query performance. Disabled this parameter during high transaction load.
innodb_log_buffer_size
The size in bytes of the buffer that InnoDB uses to write to the log files on disk. Increased from default size to support high WRITE throughput.


Storage
  • RAID 0 with 5 x 4TB EBS PIOPS volumes
  • LVM to manage two Logical Volume’s (DB and DRBD Metadata) within single Volume Group.


CPU Scheduler Tunables


Tunable
Remarks
kernel.numa_balancing
Linux support automatic numa balancing feature that results in higher kernel overhead caused by frequent mapping/unmapping of application memory pages. One should disable it and instead use numa API in application or via sysadmin utility 'numactl" to hint kernel on how its memory allocation should be handled.

VM Tunables
Tunable
Remarks
dirty_ratio
Throttle writes when dirty (modified) pages in the file system cache reach 40% of physical memory. Raise it to improve application write throughput
swappiness
Disables Linux periodic page out activities. Setting to zero will cause pages sitting in the file system cache to be paged out during normal operation when application needs more memory.
dirty_background_ratio
Wakes up flusher kernel thread when dirty pages reach 10% of total memory. Lowering the value (5%) wakes up flusher thread early and thus keep dirty pages growth in check


File system and IO Storage Metrics


Tunable
Remarks
aio-max-nr
Increase limit on number of AIO (asynchronous request) in the kernel.
rq_affinity
Allows block layer processing of IO completion to be scheduled on multiple CPUs instead of the one that services the interrupt. Setting the value to 2 forces the IO completion on the CPU that originally issued the IO. Thus maximizes scalability and cache affinity by steering IO completion to cpus local to application
scheduler
choice of IO scheduler. cfq is a fair share IO scheduler that can be used to set quality of service to IO submitted to storage.


  • Jyoti Shandil, Ravi Nyalakonda, Rajesh Matkar, Roopa Tangirala

Vizceral Open Source

$
0
0

Previously we wrote about our traffic intuition tool, Flux.  We have some announcements and updates to share about this project.  First, we have renamed the project to Vizceral.  More importantly, Vizceral is now open source!

Open Source

Vizceral transformed the way we understand and digest information about the state of traffic flowing into the Netflix control plane. We wanted to be able to intuit decisions based on the holistic state of the system. To get that, we needed a tool that gives us an intuitive understanding of the entire system at a glance. We can’t afford to be bogged down in analysis of quantitative or numerical data, or otherwise ‘parse’ the information in a typical dashboard. When we can apply an intuitive approach instead of relying on the need to parse data, we can minimize the time an outage impacts millions of members. We call the practice of building these types of systems Intuition Engineering. Vizceral is our first, flagship example of Intuition Engineering.

Here is a video of a simulation of the global view of Vizceral when moving traffic between regions.


Here is a screenshot of the same global view.

Global View.  Note that the numbers in the screenshot are example data, not real.
After proving the importance of Intuition Engineering internally on the Traffic Team at Netflix, we weighed the responsibilities of maintaining an open source project against the benefit that we get from diverse input from and the benefit that we think we can provide to the community at large. The feedback that we received after the initial blog post was overwhelmingly positive. Several individuals and companies expressed interest in the code, and in contributing back to the project. This gave us a pretty strong signal of value to the community. Ultimately we decided to take the plunge and share our solution. Here are four repos that we are open sourcing:
  • vizceral: The main UI component that lets you view and interact with the graph data. 
  • vizceral-react: A react component wrapper around vizceral to make it easier to integrate the visualization into a react project. 
  • vizceral-component: A web component wrapper around vizceral to make it easier to integrate the visualization into a project using web components. 
  • vizceral-example: An example project that uses vizceral-react and sample data as a proof of concept and a jumping off point for integrating the visualization into your own data sources. 
The component takes a simple JSON definition of graph data (nodes and connections) with some metrics and handles all of the rendering.

Internally at Netflix, we have a server-side service that gathers data from Atlas and our internal distributed request tracing service called Salp. This server-side service transforms the data into the format needed for the Vizceral component and updates the UI via web sockets. We separated the logic into the distinct parts of Vizceral and the server-side service so that we can reuse the visualization with any number of data sources.

Regional View

In the previous post, we discussed the global view, showing the traffic flowing into all the Netflix regions and the traffic being proxied between regions. What if you want to get more detail about a specific region?

Introducing the regional view:


Here is a screenshot of the same view.

Regional View
If we click on one of the regions, it brings us to a zoomed-in view of the microservices operating in that region. The far left side has one node which represents the ‘internet’ and all the connections from the internet are the entry points into the stack. We use similar concepts as in the global view, but simplified: circle nodes with connections between them with traffic dots flowing on the connections.

We minimized the inter-node connections to a single lane of travel to minimize noise. The traffic dots represent the same thing as the global view, with yellow and red dots showing degraded and error responses between services. The nodes also can change color based on assumed health of the underlying service to give another quick focal point for where problems might exist in the system.

We tried a bunch of standard graph layout algorithms, but all of the ones we found were more focused on ‘grouping close nodes’ or ‘not overlapping connections.’ Grouping close nodes actually did us a disservice since closeness of nodes does not mean they are dependent on one another. Connections not overlapping would be nice, but not at the expense of left-to-right flow. We tested our own, very simple layout algorithm that focused on a middle weighted, left-to-right flow with a few simple modifications. This algorithm has much room for improvement, but we were immediately happier with this layout than any of pre-canned options. Even with the less than perfect layout, this visualization provides a great overall picture of the traffic within a given region and a good gut feeling about the current state of the region.

If you want to look at a service in even more detail, you can hover over the node to highlight incoming and outgoing connections.

Service Highlighted
You can click on a node, and a contextual panel pops up that we can fill with any relevant information.

Context Panel for Highlighted Service

Currently, we just show a tabular view of the connections, and the list of services that make up this node, but we are adding some more detailed metrics and integrations with our other insight tooling.

If you want to dig in even further, you can double click on the node to enter the node focused view.

Focus on Service
This view allows us to really focus on traffic between the service and its upstream and downstream dependencies without being distracted by the rest of the region.

Getting Started

The easiest way to get started would be to follow the setup instructions in the vizceral-example project. This will setup a fully functional project with dummy data, running on your development machine.

If you would like more information on this project, check out the following presentation. Justin Reynolds, tech lead on this project, gave a talk at Monitorama on 6/29/2016 about Vizceral that provides additional context on the how and the why.

Vizceral has proven extremely useful for us on the Traffic Team at Netflix, and we are happy to have the opportunity to share that value. Now that it is open sourced, we are looking forward to discussions about use cases, other possible integrations, and any feature/pull requests you may have.

-Intuition Engineering Team at Netflix
Justin Reynolds, Casey Rosenthal


Introducing Winston - Event driven Diagnostic and Remediation Platform

$
0
0
Netflix is a collection of micro services that all come together to enable the product you have come to love. Operating these micro services is also distributed across the owning teams and their engineers. We do not run a central operations team managing these individual services for availability. What we do instead is invest in tools that help Netflix engineers operate their services for high availability and resiliency. Today, we are going to talk about one such tool recently built for Netflix engineers - Winston

Problem Space

Consider a typical mid tier micro service at Netflix. It's a single purpose service hosted on AWS. It uses Jenkins for builds, Spinnaker for deployment and Atlas for monitoring. Alerts are configured on top of metrics using the Atlas stack language. Atlas supports triggering a set of predefined actions when the alert fires, namely instance level remediation (terminate instance, reboot, remove from service discovery etc.), escalations (email, page) or publish to SQS for further integration.  
Any action beyond the small set already supported is not a first class citizen within the Atlas framework to reduce complexity and manage resiliency of Atlas itself. Let’s call any of these custom steps for diagnostics and remediation a runbook. Hosting and executing these runbooks usually took the form of
  1. An email or page to a human who has either documented these runbooks on a wiki/doc or written one off tools and scripts to code it up.
  2. A custom micro service that listens to the SQS integration point from Atlas and implements the run book.
Both of these approaches have drawbacks. Escalating to humans to have them do manual repeatable tasks is not the best use of our engineer's time. No one likes to get paged and wake up in the middle of the night to follow some documentation or kick off a script or a tool that a piece of software could have easily done.
Building a custom micro service means that the application team now needs to take on the additional burden of keeping the availability and resiliency of that service high, build integration with Atlas or other monitoring tools in use, manage deployment life cycle, deprecation cycles for dependencies and worry about safety and security as well. Engineers would rather not deal with these ongoing infrastructure tasks just to host and execute their scripts that encapsulate their business logic.
Winston was created to help engineers achieve their runbook automation without managing the infrastructure and associated core features. And in case you are wondering, it’s named after Winston Wolfe, a character from the movie Pulp Fiction who has a “runbook” to solve problems and creates a controlled and safe environment to execute them.

Winston to the rescue

Winston provides an event driven runbook automation platform for Netflix engineers. It is designed to host and executerunbooks in response to operational events like alerts. Winston’s goals is to act as Tier-1 support for developers where they can outsource their repeatable diagnostic and remediation tasks and have them run automatically in response to events.


Customers provide the following inputs to configure Winston for their use case
  • Runbook as code (for our first version, we only support python to code runbooks in).
  • Events (like Atlas alerts) that should trigger that runbook. This can be one or many.
Winston in turn provides the following features that make life easier for Netflix engineers in building, managing and executing their runbooks.

Self serve interface - Winston Studio

From the get go, we aimed to make Winston a self serve tool for our engineers. To help improve usability and make it really easy to experiment and iterate, we created a simple and intuitive interface that our customers use. Winston Studio is our one stop shop for on boarding new runbook automation's, configuring existing ones, looking at logs from the runs in production, debugging and managing the runbook life cycle.


Here is a snapshot of a runbook automated by Real Time Data Infrastructure team at Netflix to troubleshoot and remediate the problem when one of their Kafka brokers is detected to be offline. As you can see in the snapshot, customers can write code to automate their runbooks, configure events that will trigger its execution, configure failure notification settings and also manually run their automation to test changes before deploying them.
 
Winston_Studio runbook CRUD page for given runbook


Users can also look at the previous executions and individual execution details through Winston Studio as shown in the following snapshots.


Winston Studio execution list for given runbook
Winston Studio execution details

Runbook life cycle management

Winston implements a paved path on how runbooks are deployed and managed. It supports  multiple versions of a given runbook, one for each environment (dev/test/prod).  All runbooks given to Winston are stored in Stash, which is our persistent store for code. It supports versioning and appropriate security models and is a good fit for storing code which is what a runbook is. Each team gets its own isolated repository in Stash and each environment(dev/test/prod) is represented by its own branch in the repository. Winston includes an automated promotion and deployment pipeline. Promotions are triggered manually by engineers through the studio. Deployments gets triggered every time runbooks are promoted or updated via Studio. Runbooks get deployed to all instances of Winston in all three zones and across all four AWS regions within minutes.

HA deployment

Winston deployments are region and stack isolated. Region isolation is to handle region failures (us-east-1 region going down should not affects executions in us-west-2). Stack isolation is to separate our test environment from our critical prod environment and provide an isolated space to test your runbooks before deploying to prod. We also provide dev environment to be able to develop and manually test runbooks before deploying them to test environment.


As you can see in the following diagram, we separate out the compute from persistence. We use a MongoDB replica set for data resiliency and automatic failover in case when db primary dies.  Multiple instances in the same region and environment share the same MongoDB cluster. Winston studio is only a deployment time dependency and not a run time dependency for us so we chose to host the studio in a single region but make it multi instance running behind a load balancer to handle instance failures.




Winston HA Deployment model

Winston Studio and Winston Deployment

You may think that there is a runbook update propagation line (red arrow) missing between S3 bucket and Winston DEV cluster. The reason why this is not required is because we have a shared file system between Winston Studio and Winston DEV compute instances. This helps in faster iterations when you are updating and testing your runbooks multiple times through Winston Studio.


If we look at the zoomed in view of one of the Winston compute instances (shown in the following diagram), we can see that it host an SQS sensor to consume incoming events, rules engine to connect events to runbooks and action runners to execute the runbooks as shown below.



Winston compute instance

Zoomed in view of a Winston instance

Integrations

Winston has integrations with Atlas as an event source to manage the pipeline of events coming to it. For Atlas, it uses the SQS action as an integration hook. It also provides a bunch of outbound integration API's to talk to Netflix eco system that engineers can use as part of their runbooks if they need.  This is an opinionated set of API's built to make automation's easy to write.

Supporting technologies

While Winston serves as a great place to host orchestrator runbooks, we also need a way to execute instance level runbooks. We have built a REST based async script runner called BOLT which lives as a daemon on every AWS instance for a given app and provides a platform for hosting and executing instance level runbooks. BOLT provides an automated deployment pipeline as well for iterating over BOLT runbooks.

Usage

We have had Winston out in production since early this year. So far, there are 7 teams on board with 22 unique runbooks hosted on Winston. On an average, we run 15 executions an hour on Winston. Each of these executions would have been manual or skipped as it required manual intervention from our engineers before. Common patterns around usage fall in these buckets:
  • Filter false positives - Squelch alerts from Atlas using custom diagnostic steps specific to the service. This reduced pager fatigue and on call pain.
  • Diagnostics - Collect contextual information for the developer on call.
  • Remediation - Under safe conditions, apply mitigation steps to quickly resolve the issue and bring back the service to a healthy state.
  • Broker - Pass the alert event to an existing tool that can handle diagnostics and mitigation, managing protocol and data model conversion as part of the runbook.

Build vs. buy

When kick starting this project, we looked to see if we wanted to build something custom or re-use what's already there. After some prototyping, talking with peers in the industry and analysis of different solutions in the market, we chose to go a hybrid route of re-using open source solution and building custom software to fill the gaps. We decided to use StackStorm as our underlying engine to host and execute our runbooks. StackStorm was chosen because of following key attributes
  • Alignment with the problem we aimed to solve (event driven runbook automation).
  • The fact that it was open source allowed us to review code and architecture quality in detail.
  • Very pluggable architecture meant we can integrate it within Netflix environment easily.
  • Smart and responsive team backing the product.
Choosing StackStorm allowed us to quickly bootstrap without reinventing the wheel. This allowed us to focus on Netflix specific features and integrations and reduced our time to market significantly.

Moving forward

There are lot of improvements we want to make to our product, both for increased operational resiliency and providing more features for our customers to build on top of. Listed below are some key ones in each category.

Resiliency

  • We are actively looking at providing resource (memory/CPU) and security isolation for individual executions by utilizing container technology.
  • We want to invest in at-least-one guarantee for events flowing through our platform. Currently there are scenarios in which events are abandoned under some failure scenarios.

Features

  • Polyglot - We would like to add support for additional languages for runbook authoring (Java is of special interest here).
  • More self serve features - Support one-to-many and many-to-one relationship and custom parameter mappings between events and runbooks.
  • Safety - Remediation steps automated and gone haywire can cause considerable damage. We would like to look at providing safety features (e.g. rate limiting, cross event correlation)


Our goal is to continue increasing adoption within Netflix. We aim to learn and grow the product to have bigger and better impact on availability of Netflix as well as keep Netflix engineers happy.

Summary

We talked about the need for a product like Winston at Netflix. We talked about our approach to re-use open source and build when necessary to quickly bootstrap on our need. We went through high level architecture, deployment model, features and current usage of Winston.
Automated diagnostics and remediation of software is still a niche area at Netflix and in the industry. Our goal is to continue to refine a paved path in this space for Netflix and have material impact on MTTR and developer productivity. Winston is a step in that direction and provides the right platform to help engineers automate their way out of repeatable tasks. If this area and approach excites you, please reach out, we would love to talk to you about it.
By: Sayli Karmarkar& Vinay Shah on behalf of Diagnostics and Remediation Engineering(DaRE) team


Protecting Netflix Viewing Privacy at Scale

$
0
0
On the Open Connect team at Netflix, we are always working to enhance the hardware and software in the purpose-built Open Connect Appliances (OCAs) that store and serve Netflix video content. As we mentioned in a recent company blog post, since the beginning of the Open Connect program we have significantly increased the efficiency of our OCAs - from delivering 8 Gbps of throughput from a single server in 2012 to over 90 Gbps from a single server in 2016. We contribute to this effort on the software side by optimizing every aspect of the software for our unique use case - in particular, focusing on the open source FreeBSD operating system and the NGINX web server that run on the OCAs.


Members of the team will be presenting a technical session on this topic at the Intel Developer Forum (IDF16) in San Francisco this month. This blog introduces some of the work we’ve done.

Adding TLS to Video Streams



In the modern internet world, we have to focus not only on efficiency, but also security. There are many state-of-the-art security mechanisms in place at Netflix, including Transport Level Security (TLS) encryption of customer information, search queries, and other confidential data. We have always relied on pre-encoded Digital Rights Management (DRM) to secure our video streams. Over the past year, we’ve begun to use Secure HTTP (HTTP over TLS or HTTPS) to encrypt the transport of the video content as well. This helps protect member privacy, particularly when the network is insecure - ensuring that our members are safe from eavesdropping by anyone who might want to record their viewing habits.


Netflix Open Connect serves over 125 million hours of content per day, all around the world. Given our scale, adding the overhead of TLS encryption calculations to our video stream transport had the potential to greatly reduce the efficiency of our global infrastructure. We take this efficiency seriously, so we had to find creative ways to enhance the software on our OCAs to accomplish this objective.


We will describe our work in these three main areas:
  • Determining the ideal cipher for bulk encryption
  • Finding the best implementation of the chosen cipher
  • Exploring ways to improve the data path to and from the cipher implementation

Cipher Evaluation


We evaluated available and applicable ciphers and decided to primarily use the Advanced Encryption Standard (AES) cipher in Galois/Counter Mode (GCM), available starting in TLS 1.2. We chose AES-GCM over the Cipher Block Chaining (CBC) method, which comes at a higher computational cost. The AES-GCM cipher algorithm encrypts and authenticates the message simultaneously - as opposed to AES-CBC, which requires an additional pass over the data to generate keyed-hash message authentication code (HMAC). CBC can still be used as a fallback for clients that cannot support the preferred method.


All revisions of Open Connect Appliances also have Intel CPUs that support AES-NI, the extension to the x86 instruction set designed to improve encryption and decryption performance.
We needed to determine the best implementation of AES-GCM with the AES-NI instruction set, so we investigated alternatives to OpenSSL, including BoringSSL and the Intel Intelligent Storage Acceleration Library (ISA-L).

Additional Optimizations



Netflix and NGINX had previously worked together to improve our HTTP client request and response time via the use of sendfile calls to perform a zero-copy data flow from storage (HDD or SSD) to network socket, keeping the data in the kernel memory address space and relieving some of the CPU burden. The Netflix team specifically added the ability to make the sendfile calls asynchronous - further reducing the data path and enabling more simultaneous connections.




However, TLS functionality, which requires the data to be passed to the application layer, was incompatible with the sendfile approach.




To retain the benefits of the sendfile model while adding TLS functionality, we designed a hybrid TLS scheme whereby session management stays in the application space, but the bulk encryption is inserted into the sendfile data pipeline in the kernel. This extends sendfile to support encrypting data for TLS/SSL connections.




We also made some important fixes to our earlier data path implementation, including eliminating the need to repeatedly traverse mbuf linked lists to gain addresses for encryption.

Testing and Results



We tested the BoringSSL and ISA-L AES-GCM implementations with our sendfile improvements against a baseline of OpenSSL (with no sendfile changes), under typical Netflix traffic conditions on three different OCA hardware types. Our changes in both the BoringSSL and ISA-L test situations significantly increased both CPU utilization and bandwidth over baseline - increasing performance by up to 30%, depending on the OCA hardware version. We chose the ISA-L cipher implementation, which had slightly better results. With these improvements in place, we can continue the process of adding TLS to our video streams for clients that support it, without suffering prohibitive performance hits.


Read more details in this paper and the follow up paper. We continue to investigate new and novel approaches to making both security and performance a reality. If this kind of ground-breaking work is up your alley, check out our latest job openings!

By Randall Stewart, Scott Long, Drew Gallatin, Alex Gutarin, and Ellen Livengood

Building fast.com

$
0
0

On our company blog in May, we introduced fast.com, our new internet speed test. The idea behind fast.com is to provide a quick and simple way for any internet user to test their current internet speed, whether they are a Netflix member or not. Since fast.com was released, millions of internet users around the world have run the test. We have seen a lot of interest in the site and questions about how it works. This blog will give a high-level overview of how we handled some of the challenges inherent with measuring internet speeds and the technology behind fast.com.


But first, some news - we are happy to announce a new FAST mobile app, available now for Android or Apple mobile devices. Get the free app from the Apple App Store or Google Play.

Design goals



When designing the user experience for the fast.com application, we had several important goals in mind:


  • Provide accurate, consistent results that reflect users’ real-life internet use case
  • Load and run as quickly as possible
  • Provide simple results that are easy to understand
  • Work on most devices from the browser without requiring installation of a separate application


We wanted to make sure that fast.com could be easily used and understood by the majority of internet users, without requiring them to have any prior knowledge of computer networking, command line tools, and the like.

Technical goals



There are various ways to go about measuring internet speed and many variables that can impact any given measurement, some of which are not under our control. For example -  configuration of the user’s local or home network, device or router performance, other users on the network, TCP or network configuration on the device. However, we thought carefully about the variables that are under our control and how they would further our overall goal of a simple but meaningful test.


Variables that are under our control, and which can influence the results of the test, include things like:


  • Server location
  • Load on the server
  • Number of TCP connections used
  • Size and type of download content used
  • Methodology used to aggregate measurements


One major advantage we have is our Open Connect CDN, a globally-distributed network of servers (Open Connect Appliances or OCAs) that store and serve Netflix content to our members - representing as much as 35% of last-mile internet peak traffic in some regions. Using our own production servers to test internet speed helps to ensure that the test is a good representation of the performance that can be achieved during a real-life user scenario.


In pursuit of the design goal of simplicity, we deliberately chose to measure only download speed, measuring how fast data travels from server to consumer when they are performing activities such as viewing web pages or streaming video. Downloads represent the majority of activity for most internet consumers.


We also decided on the following high-level technical approaches:


  • To open several connections for the test, varying the number depending on the network conditions
  • To run the test on several of our wide network of Netflix production OCAs, but only on servers that have enough capacity to serve test traffic while simultaneously operating within acceptable parameters to deliver optimal video quality to members
  • To measure long running sessions - eliminating connection setup and ramp up time and short term variability from the result
  • To dynamically determine when to end the test so that the final results are quick, stable, and accurate
  • To run the test using HTTPS, supporting IPv4 and IPv6

Architecture



As mentioned above, fast.com downloads test files from our distributed network of Open Connect Appliances (OCAs). Each OCA server provides an endpoint with a 25MB video file. The endpoint supports a range parameter that allows requests for between a 1 byte to a 25MB chunk of content.
In order to steer a user to an OCA server, fast.com provides an endpoint that returns a list of several URLs for different OCAs that are best suited to run the test. To determine the list, the endpoint uses logic that is similar to the logic that is used to steer netflix.com video delivery. The OCAs that are returned are chosen based on:


  • Network distance
  • Traffic load for each OCA, which indicates overall server health
  • Network structure - each OCA in the list belongs to a different cluster




As soon as the fast.com client receives the URLs, the test begins to run.


Estimating network speed



The test engine uses heuristics to:


  • Strip off measurements that are collected during connection setup/ramp up
  • Aggregate the rest of the collected measurements
  • Decide how many parallel connections to use during the test
  • Try to separate processing overhead from network time - because fast.com runs in the browser, it has limited visibility into timing of network events like DNS resolution time, processing of packets on the client side and latency to test server
  • Make a decision about when the client has collected enough measurements to confidently present the final network speed estimate


We exclude initial connection ramp up, but we do take into account any performance drops during the test. Network performance drops might indicate a lossy network, congested link, or faulty router - therefore, excluding these drops from the test result would not correctly reflect issues experienced by users while they are consuming content from the internet.

Number of connections



Depending on network throughput, the fast.com client runs the test using a variable number of parallel connections. For low throughput networks, running more connections might result in each connection competing for very limited bandwidth, causing more timeouts and resulting in a longer and less accurate test.


When the bandwidth is high enough, however, running more parallel connections helps to saturate the network link faster and reduce test time. For very high throughput connections, especially in situations with higher latency, one connection and a 25MB file might not be enough to reach maximum speeds, so multiple connections are necessary.

Size of downloads



For each connection, the fast.com client selects the size of the chunk of the 25MB file that it wants to download. In situations where the network layer supports periodical progress events, it makes sense to request the whole file and estimate network speed using download progress counters. In cases where the download progress event is not available, the client will gradually increase payload size during the test to perform multiple downloads and get a sufficient number of samples.


Computing the results



After the download measurements are collected, the client combines the downloaded content across all connections and keeps the snapshot speed.


The ‘instant’ network measurements are then passed to the results aggregation module. The aggregation module makes sure that:


  • We exclude initial connection ramp up
  • We take the rest and compute rolling average of the other measurements




One of the primary challenges for the fast.com client is determining when the estimated speed measurements are ready to be presented as a final estimate. Due to the various environments and conditions that the fast.com test can be run under, the test duration needs to be dynamic.


For stable low latency connections, we quickly see growth to full network speeds:




Higher latency connections take much longer to ramp up to full network speed:


Lossy or congested connections show significant variations in instant speed, but these instant variations get smoothed out over time. It is also harder to correctly identify the moment when connections have ramped up to full speed.




In all cases, after initial ramp up measurements are excluded, the ‘stop’ detection module monitors how the aggregated network speed is changing and makes a decision about whether the estimate is stable or if more time is needed for the test. After the results are stable, they are presented as a final estimate to the user.


Conclusion and Next Steps



We continue to monitor, test, and perfect fast.com, always with the goal of giving consumers the simplest and most accurate tool possible to measure their current internet performance. We plan to share updates and more details about this exciting tool in future posts.

By Sergey Fedorov and Ellen Livengood

Automated testing on devices

$
0
0
As part of the Netflix SDK team, our responsibility is to ensure the new release version of the Netflix application is thoroughly tested to its highest operational quality before deploying onto gaming consoles and distributing as an SDK (along with a reference application) to Netflix device partners; eventually making its way to millions of smart TV’s and set top boxes (STB’s). Overall, our testing is responsible for the quality of Netflix running on millions of gaming consoles and internet connected TV’s/STB’s.

Unlike software releases on the server side, the unique challenge with releases on devices is that there can be no red/black pushes or immediate rollbacks in case of failure. If there is a bug in the client, the cost of fixing the issue after the code has been shipped on the client device is quite high. Netflix has to re-engage with various partners whose devices might already have been certified for Netflix, kicking off the cycle again to re-certify the devices once the fix has been applied, costing engineering time both externally and internally. All the while, customers might not have a workaround to the problem, hence exposing them to suboptimal Netflix experience. The most obvious way to avoid this problem is to ensure tests are conducted on devices in order to detect application regressions well before the release is shipped.

This is a first part on a series of posts to describe key concepts and infrastructure we use to automate functional, performance, and stress testing the Netflix SDK on a number of devices.


Aspirational Goals

Over the years, our experience with testing the Netflix application using both manual and automated means taught us several lessons. So when the time came to redesign our automation system to go to the next level and scale up we made sure to set them as core goals.

Low setup cost / High test “agility”
Tests should not be harder to create and/or use when automation is used. In particular tests that are simple to run manually should stay simple to run in the automation. This means that using automation should have close to zero setup cost (if not none). This is important to make sure that creating new tests and debugging existing ones is both fast and painless. This also ensures the focus stays on the test and features in test as long as possible.

No test structure constraint
Using an automation system should not constrain tests to be written in a particular format. This is important in order to allow future innovation in how tests are written. Furthermore different teams (we interact with teams responsible for platform, security, playback/media/ UI, etc) might come up with different ways to structure their tests in order to better suit their needs. Making sure the automation system is decoupled from the test structure increase its reusability.

Few layers at the test level
When building a large scale system, it is easy to end up with too many layers of abstraction. While this isn’t inherently bad in many cases, it becomes an issue when those layers are also added in the tests themselves in order to allow them to integrate with automation. Indeed the further away you are from the feature you actually test, the harder it is to debug when issues arise: so many more things outside of the application under test could have gone wrong.

In our case we test Netflix on devices, so we want to make sure that the tests run on the device itself calling to functions as close as possible to the SDK features being tested.

Support important device features
Device management consumes a lot of time when done manually and therefore is a big part of a good automation system. Since we test a product that is being developed, we need the ability to change builds on the fly and deploy them to devices. Extracting log files and crash dumps is also very important to automate in order to streamline the process of debugging test failure.

Designing automation

With these goals in place, it was clear that our team needed a system providing the necessary automation and device services while at the same time staying out of the way of testing as much as possible.

This required rethinking existing frameworks and creating a new kind of automation ecosystem. In order for automation to provide that flexibility, we needed the automation system to be lean, modular and require external services only when absolutely needed for testing a feature, that is to say only if the functionality cannot be done directly from the application on the device (for example suspend the application or manipulate the network).

Reducing the use of external services to the strict minimum has a few benefits:
  • It ensures that the logic about the test resides within the test itself as much as possible. This improves readability, maintenance and debuggability of the test. 
  • Most tests end up having no external dependencies allowing developers trying to reproduce a bug to run the test with absolutely no setup using the tools they are used to
  • The test case author can focus on testing the functionality of the device without worrying about external constraints.

At the simplest level, we needed to have two separate entities:
  • Test Framework 
    A software abstraction helping the writing of test cases by exposing functions taking care of the test flow of control.

    A test framework is about helping writing tests and should be as close as possible to the device/application been tested in order to reduce the moving parts needed to be checked when debugging a test failure.

    There could be many of them so that different teams can structure their tests in a way that matches their needs.
  • Automation Services 
    A set of external backend services helping with the management of devices, automating execution of tests and when absolutely required providing external features for testing. Automation services should be built in the most standalone manner as possible. Reducing ties between services allows for better reusability, maintenance, debugging and evolution. For example services which aid in starting the test, collecting information about the test run, validating test results can be delegated to individual microservices. These microservices aid in running the test independently and are not required to run a test. Automation service should only provide service and should not control the test flow

    For instance, the test can ask an external service to restart the device as part of test flow. But the service should not be dictating the test to restart the device and control test flow.

Building a Plug and Play Ecosystem

When it came to designing automation services, we looked at what was needed from each of these services.
  • Device Management
    While the tests themselves are automated, conducting tests on a wide range of devices requires a number of custom steps such as flashing, upgrading, and launching the application before the test starts as well as collecting logs and crash dumps after the test ends. Each of these operations can be completely different on each device. We needed a service abstracting the device specific information and providing a common interface for different devices
  • Test Management
    Writing tests is only a small part of the story: the following must also be taken care of:
     - Organizing them in groups (test suites)
     - Choosing when to run them
     - Choosing what configuration to run them with
     - Storing their results
     - Visualizing their results
  • Network Manipulation
    Testing the Netflix application experience on a device with fluctuating bandwidth is a core requirement for ensuring high quality uninterrupted playback experience. We needed a service which could change network conditions including traffic shaping and DNS manipulation.
  • File Service
    As we start collecting builds for archival purpose or for storing huge log files, we needed a way to store and retrieve these files and file service was implemented to assist with this.
  • Test Runner
    Each service being fully independent we needed an orchestrator that would talk to the separate services in order to get and prepare devices before tests are run and collecting results after the tests ends.

With the above mentioned design choices in mind, we built the following automation system.
The services described below evolved to meet the above specified needs with the principles of being as standalone as possible and not tied into the testing framework. These concepts were put in practice as described below.


Device service
The device service abstracts the technical details required to manage a device from start to end.  By exposing a simple unified RESTful interface for all type of devices, consumers of this service no longer need to have any device specific knowledge: they can use all and any devices as if they were the same.

The logic of managing each type of devices in not directly implemented on the device service itself but instead delegated to other independent micro-services called device handlers.

This brings flexibility is adding support of new type of devices since device handlers can be written in any programing language using their own choice of REST APIs and existing handlers can easily be integrated with the device service. Some handlers can also sometimes require a physical connection to the device therefore decoupling the device service from the device handlers gives flexibility in where to locate them.

For each request received, the role of the device service is to figure out which device handler to contact and proxy the request to it after having adapted it to the set of REST API the device handler interfaces with.

Let us look at a more concrete example of this… The action for installing a build on PS4 for example is very different than installing a build on Roku. One relies on code written in C# interfacing with ProDG Target Manager running on Windows (for PlayStation) and the other written in Node.js running on Linux. The PS4 and Roku device handlers both implement their own device specific installation procedure. 
If the device service needs to talk to a device, it needs to know the device specific information. Each device, with its own unique identifier is stored and accessible by the device service as a device map object, containing information regarding the device needed by the handler. For example:
  • Device IP or hostname 
  • Device Mac address (optional) 
  • Handler IP or hostname 
  • Handler Port 
  • Bifrost IP or hostname (Network service) 
  • Powercycle IP or hostname (remote power management service) 
The device map information is populated when adding device into our automation for the first time.

When a new device type is introduced for testing, a specific handler for that device is implemented and exposed by the device service. The device service supports the following common set of device methods:


POST /device/install
Installs the Netflix application
POST /device/start
Launches the Netflix application with a given set of launch parameters
POST /device/stop
Stops the Netflix application
POST /device/restart
Restarts the Netflix application (stop + start essentially)
POST /device/powercycle
Power-cycles the device. Either via direct or remote power boot.
GET /device/status
Retrieves information about the device (ex: running, stopped, etc…)
GET /device/crash
Collects the Netflix application crash report
GET /device/screenshot
Grabs a full screen render of the active screen
GET /device/debug
Collects debug files produced by the device

Note that each of these endpoints require a unique device identifier to be posted to the request. This identifier (similar to a serial number) is tied to the device being operated.

Keeping the service simple allows it to be quite extensible. Introducing additional capability for devices can be easily done, and if a device does not support the capability, it simply NOOPs it.

The device service also acts as a device pooler:

POST /device/reserve
Reserves a device and get a lease for a period of time.
PUT /device/reserve
Renew the lease of of previously reserved device
GET /device/reserve
List the devices currently reserved
POST /device/release
Release a device that was previously reserved
POST /device/disable
Temporarily black lists the device from being used (in the event of a non-operation device situation or flaky health).
GET /device/disable
List the devices currently disabled

Here are some pictures of some of the devices that we are running in the lab for automation. Notice the little mechanical hand near the power button for Xbox 360. This is a custom solution that we put together just for Xbox 360 as this device requires manual button press to reboot it. We decided to automate this manual process by designing a mechanical arm connected to a raspberry pi which sends control over to the hand for moving and pressing the power button. This action was added to the Xbox 360 device handler. The powercycle endpoint of device service calls the power cycle handler of Xbox 360. This action is not necessary for PS3 or PS4 and is not implemented in those handlers.

IMG_8099.JPG

IMG_8101.JPG


Test service
The Test Service is the bookkeeper of a running test case session. Its purpose is to mark the start of a test case, records status changes, log messages, metadata, links to files (logs/crash minidumps collected throughout the test) and data series emitted by the test case until test completion. The service exposes simple endpoints invoked by the test framework running the test case:

POST /tests/start
Marks test as started
POST /tests/end
Mark test as ended
POST /tests/configuration
Post device configuration such as version, device model, etc...
POST /tests/keepalive
A TTL health-check in the event the device goes unresponsive
/tests/details
Post some test data/results


A test framework will typically internally call those endpoints as follow: 

  • Once the test has started, a call to POST /test/start is made 
  • A periodic keepalive is sent to POST /test/keepalive to let the Test Service know that the test is in progress.
  • Test information and results are send using POST /test/configuration and POST /tests/details while the test is running 
  • When the test ends, a call to POST /test/end is made
Network Service — Bifröst Bridge
The network system that we have built to communicate to the device and do traffic shaping or dns manipulation is called the Bifröst Bridge. We are not altering the network topology and we are connecting the devices directly to the main network. Bifrost bridge is not required to run the tests and only optionally required when the tests require network manipulation such as overriding DNS records.


File Service
As we are running tests, we can opt to collect files produced by the tests and upload them to a storage depot via the file service. These include device log files, crash reports, screen captures, etc... The service is very straightforward from a consumer client perspective:

POST /file
Uploads a file without specifying a name resulting in a unique identifier in the response that can be later used for download
GET /file/:id
Downloads a file with a given identifier


The file service is back by cloud storage and resources are cached for fast retrieval using Varnish Cache.



Database
We have chosen to use MongoDB as the database of choice for the Test Service because of its JSON format and the schema-less aspect of it. The flexibility of having an open JSON document storage solution is key for our needs because test results and metadata storage are always constantly evolving and are never finite in their structure. While a relational database sounds quite appealing from a DB management standpoint, it obstructs the principle of Plug-and-Play as the DB schema needs to be manually kept up to date with whatever tests might want.

When running in CI mode, we record a unique run id for each test and collect information about the build configuration, device configuration, test details etc. Downloadable links to file service to logs are also stored in the database test entry.


Test Runner — Maze Runner
In order to reduce the burden of each test case owner to call into different services and running the tests individually, we built a controller which orchestrates running the tests and calling different services as needed called Maze Runner.

The owner of the test suite creates a script in which he/she specifies the devices (or device types) on which the tests need to be run, test suite name and the test cases that form a test suite and asks Maze Runner to execute the tests (in parallel).

Here are the list of steps that Maze Runner does
  1. Finds a device/devices to run on based on what was requested
  2. Calls into the Device Service to install a build 
  3. Calls into the Device Service to start the test 
  4. Wait until the test in marked as “ended” in the Test Service 
  5. Display the result of the test retrieved using the Test Service 
  6. Collect log files using the Device Service 
  7. If the test did not start or did not end (timeout), Maze Runner checks whether the application has crashed using the Device Service. 
  8. If the crash is detected, it collects the coredump, generates call stack and runs it through a proprietary call stack classifier and detects a crash signature 
  9. Notify the Test Service if a crash or timeout occurred. 
  10. At any point during the sequence, if Maze Runner detects a device has an issue (the build won’t install or the device won’t start because it lost its network connectivity for example), it will release the device, asking the device service to disable it for some period of time and will finally get a whole new device to run the test on. The idea is that pure device failure should not impact tests.
Test frameworks
Test frameworks are well separated from automation services as they are running along tests on the devices themselves. Most tests can be run manually with no need for automation services. This was one of the core principle in the design of the system. In this case tests are manually started and the results manually retrieved and inspected when the test is done.

However test frameworks can be made to operate with automation services (the test service for example, to store the tests progress and results). We need this integration with automation services when tests are run in CI by our runner.

In order to achieve this in a flexible way we created a single abstraction layer internally known as TPL (Test Portability Layer). Tests and test frameworks call into this layer which defines simple interfaces for each automation service. Each automation service can provide an implementation for those interfaces. 

This layer allows tests meant to be run by our automation to be executed on a completely different automation system provided that TPL interfaces for this system’s services are implemented. This enabled using test cases written by other teams (using different automation systems) and run them unchanged. When a test is unchanged, the barrier to troubleshooting a test failure on the device by the test owner is completely eliminated; and we always want to keep it that way.

Progress

By keeping the test framework independent of automation services, using automation services on an as required basis and adding the missing device features we managed to:
  1. Augment our test automation coverage on gaming consoles and reference applications. 
  2. Extend the infrastructure to mobile devices (Android, iOS, and Windows Mobile). 
  3. Enable other QA departments to leverage conducting their their tests and automation frameworks against our device infrastructure. 
Our most recent test execution coverage figures show that we execute roughly 1500 tests per build on reference applications alone. To put things in perspective, the dev team produces around 10-15 builds on a single branch per day each generating 5 different build flavors (such as Debug, Release, AddressSanitizer, etc..) for the reference application. For gaming consoles, there are about 3-4 builds produced per day with a single artifact flavor. Conservatively speaking, using a single build artifact flavor, our ecosystem is responsible for running close to 1500*10 + 1500*3 =~ 20K test cases on a given day.

New Challenges

Given the sheer number of tests executed per day, two prominent sets of challenges emerge:
  1. Device and ecosystem scalability and resiliency 
  2. Telemetry analysis overload generated by test results 

In future blog posts, we will delve deeper and talk about the wide ranging set of initiatives we are currently undertaking to address those great new challenges.

Benoit Fontaine, Janaki Ramachandran, Tim Kaddoura, Gustavo Branco

Netflix and Fill

$
0
0

Tomorrow we'll release another much-anticipated new series, The Get Down. Before you can hit “Play”, we have to distribute this new title to our global network of thousands of Open Connect appliances.  Fortunately, this is now a routine exercise for us, ensuring our members around the world will have access to the title whenever they choose to watch it.

In a previous company blog post, we talked about content distribution throughout our Open Connect network at a high level. In this post, we’ll dig a little deeper into the complex reality of global content distribution. New titles come onto the service, titles increase and decrease in popularity, and sometimes faulty encodes need to be rapidly fixed and replaced. All of this content needs be positioned in the right place at the right time to provide a flawless viewing experience. So let’s take a closer look at how this works.

Title readiness

When a new piece of content is released, the digital assets that are associated with the title are handed off from the content provider to our Content Operations team. At this point, various types of processing and enhancements take place including quality control, encoding, and the addition of more assets that are required for integration into the Netflix platform. At the end of this phase, the title and its associated assets (different bitrates, subtitles, etc.) are repackaged and deployed to our Amazon Simple Storage Service (S3). Titles in S3 that are ready to be released and deployed are flagged via title metadata by the Content Operations team, and at this point Open Connect systems take over and start to deploy the title to the Open Connect Appliances (OCAs) in our network.

Proactive Caching

We deploy the majority of our updates proactively during configured fill windows. An important difference between our Open Connect CDN and other commercial CDNs is the concept of proactive caching. Because we can predict with high accuracy what our members will watch and what time of day they will watch it, we can make use of non-peak bandwidth to download most of the content updates to the OCAs in our network during these configurable time windows. By reducing disk reads (content serving) while we are performing disk writes (adding new content to the OCAs), we are able to optimize our disk efficiency by avoiding read/write contention. The predictability of off-peak traffic patterns helps with this optimization, but we still only have a finite amount of time every day to get our content pre-positioned to where it needs to be before our traffic starts to ramp up and we want to make all of the OCA capacity available for content serving.

OCA Clusters

To understand how our fill patterns work, it helps to understand how we architect OCAs into clusters, whether they are in an internet exchange point (IX) or embedded into an ISP’s network.

OCAs are grouped into manifest clusters, to distribute one or more copies of the catalog, depending on the popularity of the title. Each manifest cluster gets configured with an appropriate content region (the group of countries that are expected to stream content from the cluster), a particular popularity feed (which in simplified terms is an ordered list of titles, based on previous data about their popularity), and how many copies of the content it should hold. We compute independent popularity rankings by country, region, or other selection criteria. For those who are interested, we plan to go into more detail about popularity and content storage efficiency in future posts.

We then group our OCAs one step further into fill clusters. A fill cluster is a group of manifest clusters that have a shared content region and popularity feed. Each fill cluster is configured by the Open Connect Operations team with fill escalation policies (described below) and number of fill masters.

The following diagram shows an example of two manifest clusters that are part of the same fill cluster:
Fill Source Manifests

OCAs do not store any information about other OCAs in the network, title popularity, etc. All of this information is aggregated and stored in the AWS control plane. OCAs communicate at regular intervals with the control plane services, requesting (among other things) a manifest file that contains the list of titles they should be storing and serving to members. If there is a delta between the list of titles in the manifest and what they are currently storing, each OCA will send a request, during its configured fill window, that includes a list of the new or updated titles that it needs. The response from the control plane in AWS is a ranked list of potential download locations, aka fill sources, for each title. The determination of the list takes into consideration several high-level factors:


  • Title (content) availability - Does the fill source have the requested title stored?
  • Fill health - Can the fill source take on additional fill traffic?
  • A calculated route cost - Described in the next section.

Calculating the Least Expensive Fill Source

It would be inefficient, in terms of both time and cost, to distribute a title directly from S3 to all of our OCAs, so we use a tiered approach. The goal is to ensure that the title is passed from one part of our network to another using the most efficient route possible.

To calculate the least expensive fill source, we take into account network state and some configuration parameters for each OCA that are set by the Open Connect Operations team. For example:

  • BGP path attributes and physical location (latitude / longitude)
  • Fill master (number per fill cluster)
  • Fill escalation policies

A fill escalation policy defines:

  1. How many hops away an OCA can go to download content, and how long it should wait before doing so
  2. Whether the OCA can go to the entire Open Connect network (beyond the hops defined above), and how long it should wait before doing so
  3. Whether the OCA can go to S3, and how long it should wait before doing so

The control plane elects the specified number of OCAs as masters for a given title asset. The fill escalation policies that are applied to masters typically allow them to reach farther with less delay in order to grab that content and then share it locally with non-masters.

Given all of the input to our route calculations, rank order for fill sources works generally like this:

  1. Peer fill: Available OCAs within the same manifest cluster or the same subnet
  2. Tier fill: Available OCAs outside the manifest cluster configuration
  3. Cache fill: Direct download from S3

Example Scenario

In a typical scenario, a group of OCAs in a fill cluster request fill sources for a new title when their fill window starts. Assuming this title only exists in S3 at this point, one of the OCAs in the fill cluster that is elected as a fill master starts downloading the title directly from S3. The other OCAs are not given a fill source at this point, because we want be as efficient as possible by always preferring to fill from nearby OCAs.



After the fill master OCA has completed its S3 download, it reports back to the control plane that it now has the title stored. The next time the other OCAs communicate with the control plane to request a fill source for this title, they are given the option to fill from the fill master.


When the second tier of OCAs complete their download, they report back their status, other OCAs can then fill from them, and so on. This process continues during the fill window. If there are titles being stored on an OCA that are no longer needed, they are put into a delete manifest and then deleted after a period of time that ensures we don’t interrupt any live sessions.

As the sun moves west and more members begin streaming, the fill window in this time zone ends, and the fill pattern continues as the fill window moves across other time zones - until enough of the OCAs in our global network that need to be able to serve this new title have it stored.

Title Liveness

When there are a sufficient number of clusters with enough copies of the title to serve it appropriately, the title can be considered to be live from a serving perspective. This liveness indicator, in conjunction with contractual metadata about when a new title should be released, is used by the Netflix application - so the next time you hit “Play”, you have access to the latest and greatest Netflix content.

Challenges

We are always making improvements to our fill process. The Open Connect Operations team uses internal tooling to constantly monitor our fill traffic, and alerts are set and monitored for OCAs that do not contain a threshold percentage of the catalog that they are supposed to be serving to members. When this happens, we correct the problem before the next fill cycle. We can also perform out-of-cycle “fast track” fills for new titles or other fixes that need to be deployed quickly - essentially following these same fill patterns while reducing propagation and processing times.

Now that Netflix operates in 190 countries and we have thousands of appliances embedded within many ISP networks around the world, we are even more obsessed with making sure that our OCAs get the latest content as quickly as possible while continuing to minimize bandwidth cost to our ISP partners.

More Information

For more information about Open Connect, take a look at the website.

If these kinds of large scale network and operations challenges are up your alley, check out our latest job openings!

By Michael Costello and Ellen Livengood

Distributed delay queues based on Dynomite

$
0
0
Netflix’s Content Platform Engineering runs a number of business processes which are driven by asynchronous orchestration of micro-services based tasks, and queues form an integral part of the orchestration layer amongst these services.   

Few examples of these processes are:
  • IMF based content ingest from our partners
  • Process of setting up new titles within Netflix
  • Content Ingest, encode and deployment to CDN

Traditionally, we have been using a Cassandra based queue recipe along with Zookeeper for distributed locks, since Cassandra is the de facto storage engine at Netflix. Using Cassandra for queue like data structure is a known anti-pattern, also using a global lock on queue while polling, limits the amount of concurrency on the consumer side as the lock ensures only one consumer can poll from the queue at a time.  This can be addressed a bit by sharding the queue but the concurrency is still limited within the shard.  As we started to build out a new orchestration engine, we looked at Dynomite for handling the task queues.  

We wanted the following in the queue recipe:
  1. Distributed
  2. No external locks (e.g. Zookeeper locks)
  3. Highly concurrent
  4. At-least-once delivery semantics
  5. No strict FIFO
  6. Delayed queue (message is not taken out of the queue until some time in the future)
  7. Priorities within the shard
The queue recipe described here is used to build a message broker server that exposes various operations (push, poll, ack etc.) via REST endpoints and can potentially be exposed by other transports (e.g. gRPC).  Today, we are open sourcing the queue recipe.

Using Dynomite & Redis for building queues

Dynomite is a generic dynamo implementation that can be used with many different key-value pair storage engines. Currently, it provides support for the Redis Serialization Protocol (RESP) and Memcached write protocol. We chose Dynomite for its performance, multi-datacenter replication and high availability. Moreover, Dynomite provides sharding, and pluggable data storage engines, allowing us to scale vertically or horizontally as our data needs increase.

Why Redis?

We chose to build the queues using Redis as a storage engine for Dynomite.
  1. Redis architecture lends nicely to a queuing design by providing data structures required for building queues. Moreover, Redis in memory design provides superior performance (low latency).  
  2. Dynomite, on top of Redis, provides high availability, peer-to-peer replication and required semantics around consistency (DC_SAFE_QUORUM) for building queues in a distributed cluster.

Queue Recipe

A queue is stored as a sorted set (ZADD, ZRANGE etc. operations) within Redis.  Redis sorts the members in a sorted set using the provided score.  When storing an element in the queue, the score is computed as a function of the message priority and timeout (for timed queues).  

Push &  Pop Using Redis Primitives

The following sequence describes the high level operations used to push/poll messages into the system.  For each queue three set of Redis data structures are maintained:


  1. A Sorted Set containing queued elements by score.
  2. A Hash set that contains message payload, with key as message ID.
  3. A Sorted Set containing messages consumed by client but yet to be acknowledged. Un-ack set.
Push
  • Calculate the score as a function of message timeout (delayed queue) and priority
  • Add to sortedset for queue
  • Add message payload by ID into Redis hashed set with key as message ID.
Poll
  • Calculate max score as current time
  • Get messages with score between 0 and max
  • Add the message ID to unack set and remove from the sorted set for the queue.
  • If the previous step succeeds, retrieve the message payload from the Redis set based on ID
Ack
  • Remove from unack set by ID
  • Remove from the message payload set

Messages that are not acknowledged by the client are pushed back to the queue (at-least once semantics).

Availability Zone / Rack Awareness

Our queue recipe was built on top of Dynomite’s Java client, Dyno. Dyno provides connection pooling for persistent connections, and can be configured to be topology aware (token aware). Moreover, Dyno provides application specific local rack (in AWS a rack is a zone, e.g. us-east-1a, us-east-1b etc.) affinity based on request routing to Dynomite nodes. A client in us-east-1a will connect to a Dynomite/Redis node in the same AZ (unless the node is not available, in which case the  client will failover).  This property is exploited for sharding the queues by availability zone.

Sharding

Queues are sharded based on the availability zone. When pushing an element to the queue, the shard is determined based on round robin. This will ensure eventually all the shards are balanced. Each shard represents a sorted set on Redis with key being combination of queueName & AVAILABILITY _ZONE.

Dynomite consistency

The message broker uses a Dynomite cluster with consistency level set to DC_SAFE_QUORUM. Reads and writes are propagated synchronously to quorum number of nodes in the local data center and asynchronously to the rest. The DC_SAFE_QUORUM configuration writes to the number of nodes that make up a quorum. A quorum is calculated, and then rounded down to a whole number. This consistency level ensures all the writes are acknowledged by majority quorum.

Avoiding Global Locks





  • Each node (N1...Nn in the above diagram) has affinity to the availability zone and talks to the redis servers in that zone.
  • A Dynomite/Redis node serves only one request at a time. Dynomite can hold thousands of concurrent connections, however requests are processed by a single thread inside Redis.  This ensures when two concurrent calls are issued to poll an element from queue, they are served sequentially by Redis server avoiding any local or distributed locks on the message broker side.
  • In an event of failover, DC_SAFE_QUORUM write ensures no two client connections are given the same message out of a queue, as write to UNACK collection will only succeed for a single node for a given element.  This ensures if the same element is picked up by two broker nodes (in an event of a failover connection to Dynomite) only one will be able to add the message to the UNACK collection and another will receive failure. The failed node then moves onto peek another message from the queue to process.

Queue Maintenance Considerations

Queue Rebalancing

Useful when queues are not balanced or new availability zone is added or an existing one is removed permanently.

Handling Un-Ack’ed messages

A background process monitors for the messages in the UNACK collections that are not acknowledged by a client in a given time (configurable per queue). These messages are moved back into the queue.

Further extensions

Multiple consumers

A modified version can be implemented, where the consumer can “subscribe” for a message type (message type being metadata associated with a message) and a message is delivered to all the interested consumers.

Ephemeral Queues

Ephemeral queues have messages with a specified TTL and are only available to consumer until the TTL expires. Once expired, the messages are removed from queue and no longer visible to consumer. The recipe can be modified to add TTL to messages thereby creating an ephemeral queue. When adding elements to the Redis collections, they can be TTLed, and will be removed from collection by Redis upon expiry.

Other messaging solutions considered

  1. Kafka
Kafka provides robust messaging solution with at-least once delivery semantics.  Kafka lends itself well for message streaming use cases.  Kafka makes it harder to implement the semantics around priority queues and time based queue (both are required for our primary use case).  Case can be made to create large number of partitions in a queue to handle client usage - but then again adding a message broker in the middle will complicate things further.
  1. SQS
Amazon SQS is a viable alternative and depending upon the use case might be a good fit.  However, SQS does not support priority or time based queues beyond 15 minute delay.
  1. Disque
Disque is a project that aims to provide distributed queues with Redis like semantics. At the time we started working on this project, Disque was in beta (RC is out).
  1. Zookeeper (or comparable) distributed locks / coordinator based solutions
A distributed queue can be built with Cassandra or similar backend with zookeeper as the global locking solution. However, zookeeper quickly becomes the bottleneck as the no. of clients grow adding to the latencies.  Cassandra itself is known to have queues as anti-pattern use case.

Performance Tests

Below are some of the performance numbers for the queues implemented using the above recipe. The numbers here measures the server side latencies and does not include the network time between client and server. The Dynomite cluster as noted above runs with DC_SAFE_QUORUM consistency level guarantee.

Cluster Setup

Dynomite
3 x r3.2xlarge, us-east-1, us-west-2, eu-west-1
Message Broker
3 x m3.xlarge, us-east-1
Publisher / Consumer
m3.large, us-east-1


Dynomite cluster is deployed across 3 regions providing higher availability in case of region outages. Broker talks to the Dynomite cluster in the same region (unless the entire region fails over) as the test focuses on the measuring latencies within the region. For very high availability use cases, message broker could  be deployed in multiple region along with Dynomite cluster.

Results



Events Per Second
Poll Latency (in millisecond)
Push Latency (in millisecond)
Avg
95th
99th
Avg
95th
99th
90
5.6
7.8
88
1.3
1.3
2.2
180
2.9
2.4
12.3
1.3
1.3
2.1
450
4.5
2.6
104
1.2
1.5
2.1
1000
10
15
230
1.8
3.3
6.3


Conclusion

We built the queue recipe based on the need for micro-services orchestration. Building the recipe on top of Dynomite, provides flexibility for us to port the solution to other storage engine depending upon the workload needs.  We think the recipe is hackable enough to support further use cases. We are releasing the recipe to open source: https://github.com/Netflix/dyno-queues.

If you like the challenges of building distributed systems and are interested in  building the Netflix studio eco-system and the content pipeline at scale, check out our job openings.

Engineering Trade-Offs and The Netflix API Re-Architecture

$
0
0
Netflix’s engineering culture is predicated on Freedom & Responsibility, the idea that everyone (and every team) at Netflix is entrusted with a core responsibility. Within that framework they are free to operate with freedom to satisfy their mission. Accordingly, teams are generally responsible for all aspects of their systems, ranging from design, architecture, development, deployments, and operations. At the same time, it is inefficient to have all teams build everything that they need from scratch, given that there are often commonalities in the infrastructure needs of teams. We (like everyone else) value code reuse and consolidation where appropriate.

Given these two ideas (Freedom & Responsibility and leveragability of code), how can an individual and/or team figure out what they should optimize for themselves and what they should inherit from a centralized team? These kinds of trade-offs are pervasive in making engineering decisions, and Netflix is no exception.

The Netflix API is the service that handles the (sign-up, discovery and playback) traffic from all devices from all users. Over the last few years, the service has grown in a number of different dimensions: it’s grown in complexity, its request volume has increased, and Netflix’s subscriber base has grown as we expanded to most countries in the world. As the demands on the Netflix API continue to rise, the architecture that supports this massive responsibility is starting to approach its limits. As a result, we are working on a new architecture to position us well for the future (see a recent presentation at QCon for more details). This post explores the challenge of how, in the course of our re-architecture, we work to reconcile seemingly conflicting engineering principles: velocity and full ownership vs. maximum code reuse and consolidation.

Microservices Orchestration in the Netflix API
The Netflix API is the “front door” to the Netflix ecosystem of microservices. As requests come from devices, the API provides the logic of composing calls to all services that are required to construct a response. It gathers whatever information it needs from the backend services, in whatever order needed, formats and filters the data as necessary, and returns the response.

So, at its core, the Netflix API is an orchestration service that exposes coarse grained APIs by composing fined grained functionality provided by the microservices.
To make this happen, the API has at least four primary requirements: provide a flexible request protocol; map requests to one or more fine-grained APIs to backend microservices; provide a common resiliency abstraction to protect backend microservices; and create a context boundary (“buffer”) between device and backend teams.  

Today, the API service exposes three categories of coarse grained APIs: non-member (sign-up, billing, free trial, etc.), discovery (recommended shows and movies, search, etc.) and playback (decisions regarding the streaming experience, licensing to ensure users can view specific content, viewing history, heartbeats for user bookmarking, etc.).

Consider an example from the playback category of APIs. Suppose a user clicks the “play” button for Stranger Things Episode 1 on their mobile phone. In order for playback to begin, the mobile phone sends a “play” request to the API. The API in turn calls several microservices under the hood. Some of these calls can be made in parallel, because they don’t depend on each other. Others have to be sequenced in a specific order. The API contains all the logic to sequence and parallelize the calls as necessary. The device, in turn, doesn’t need to know anything about the orchestration that goes on under the hood when the customer clicks “play”.



 Figure 1: Devices send requests to API, which orchestrates the ecosystem of microservices.


Playback requests, with some exceptions, map only to playback backend services. There are many more discovery and non-member dependent services than playback services, but the separation is relatively clean, with only a few services needed both for playback and non-playback requests.

This is not a new insight for us, and our organizational structure reflects this. Today, two teams, both the API and the Playback teams, contribute to the orchestration layer, with the Playback team focusing on Playback APIs. However, only the API team is responsible for the full operations of the API, including releases, 24/7 support, rollbacks, etc. While this is great for code reuse, it goes against our principle of teams owning and operating in production what they build.

With this in mind, the goals to address in the new architecture are:
  • We want each team to own and operate in production what they build. This will allow more targeted alerting, and faster MTTR.
  • Similarly, we want each team to own their own release schedule and wherever possible not have releases held up by unrelated changes.

Two competing approaches
As we look into the future, we are considering two options. In option 1 (see figure 2), the orchestration layer in the API will, for all playback requests, be a pass-through and simply send the requests on to the playback-specific orchestration layer. The playback orchestration layer would then play the role of orchestrating between all playback services. The one exception to a full pass-through model is the small set of shared services, where the orchestration layer in the API would enrich the request with whatever information the playback orchestration layer needs in order to service the request.



Figure 2: OPTION 1: Pass-through orchestration layer with playback-specific orchestration layer


Alternatively, we could simply split into two separate APIs (see figure 3).



Figure 3: OPTION 2: Separate playback and discovery/non-member APIs


Both of the approaches actually solve the challenges we set out to solve: for each option, each team will own the release cycle as well as the production operations of their own orchestration layer - a step forward in our minds. This means that the choice between the two options comes down to other factors. Below we discuss some of our considerations.

Developer Experience
The developers who use our API (i.e., Netflix’s device teams) are top priority when designing, building and supporting the new API. They will program against our API daily, and it is important for our business that their developer experience and productivity is excellent. Two of the top concerns in this area are discovery and documentation: our partner teams will need to know how to interact with the API, what parameters to pass in and what they can expect back. Another goal is flexibility: due to the complex needs we have for 1000+ device types, our API must be extremely flexible. For instance, a device may want to request a different number of videos, and different properties about them, than another device would. All of this work will be important to both playback and non-playback APIs, so how is this related to the one vs. two APIs discussion? One API facilitates more uniformity in those areas: how requests are made and composed, how the API is documented, where and how teams find out about changes or additions to the API, API versioning, tools to optimize the developer experience, etc. If we go the route of two APIs, this is all still possible, but we will have to work harder across the two teams to achieve this.

Organizational implications and shared components
The two teams are very close and collaborate effectively on the API today. However, we are keenly aware that a decision to create two APIs, owned by two separate teams, can have profound implications. Our goals would, and should, be minimal divergence between the two APIs. Developer experience, as noted above, is one of the reasons. More broadly, we want to maximize the reuse of any components that are relevant to both APIs. This also includes any orchestration mechanisms, and any tools, mechanisms, and libraries related to scalability, reliability, and resiliency. The risk is that the two APIs could drift apart over time. What would that mean? For one, it could have organizational consequences (e.g., need for more staff). We could end up in a situation where we have valued ownership of components to a degree that we have abandoned component reuse. This is not a desirable outcome for us, and we would have to be very thoughtful about any divergence between the two APIs.

Even in a world where we have a significant amount of code use, we recognize that the operational overhead will be higher. As noted above, the API is critical to the Netflix service functioning properly for customers. Up until now, only one of the teams has been tasked with making the system highly scalable and highly resilient, and carrying the operational burden. The team has spent years building up expertise and experience in system scale and resiliency. By creating two APIs, we would be distributing these tasks and responsibilities to both teams.

Simplicity
If one puts the organizational considerations aside, two separate APIs is simply the cleaner architecture. In option 1, if the API acts largely as a pass-through, is it worth incurring the extra hop? Every playback request that would come into the API would simply be passed along to the playback orchestration layer without providing much functional value (besides the small set of functionality needed from the shared services). If the components that we build for discovery, insights, resiliency, orchestration, etc. can be reused in both APIs, the simplicity of having a clean separation between the two APIs is appealing. Moreover, as mentioned briefly above, option 1 also requires two teams to be involved for Playback API pushes that change the interaction model, while option 2 truly separates out the deployments.


Where does all of this leave us? We realize that this decision will have long-lasting consequences. But in taking all of the above into consideration, we have also come to understand that there is no perfect solution. There is no right or wrong, only trade-offs. Our path forward is to make informed assumptions and then experiment and build based on them. In particular, we are experimenting with how much we can generalize the building blocks we have already built and are planning to build, so that they could be used in both APIs. If this proves fruitful, we will then build two APIs. Despite the challenges, we are optimistic about this path and excited about the future of our services. If you are interested in helping us tackle this and other equally interesting challenges, come join us! We are hiring for several differentroles.

By Katharina Probst, Justin Becker

A Large-Scale Comparison of x264, x265, and libvpx - a Sneak Peek

$
0
0
With 83+ million members watching billions of hours of TV shows and movies, Netflix sends a huge amount of video bits through the Internet. As we grow globally, more of these video bits will be streamed through bandwidth-constrained cellular networks. Our team works on improving our video compression efficiency to ensure that we are good stewards of the Internet while at the same time delivering the best video quality to our members. Part of the effort is to evaluate the state-of-the-art video codecs, and adopt them if they provide substantial compression gains.

H.264/AVC is a very widely-used video compression standard on the Internet, with ubiquitous decoder support on web browsers, TVs, mobile devices, and other consumer devices. x264 is the most established open-source software encoder for H.264/AVC. HEVC is the successor to H.264/AVC and results reported from standardization showed about 50% bitrate savings for the same quality compared to H.264/AVC. x265 is an open-source HEVC encoder, originally ported from the x264 codebase. Concurrent to HEVC, Google developed VP9 as a royalty-free video compression format and released libvpx as an open-source software library for encoding VP9. YouTube reported that by encoding with VP9, they can deliver video at half the bandwidth compared to legacy codecs.

We ran a large-scale comparison of x264, x265 and libvpx to see for ourselves whether this 50% bandwidth improvement is applicable to our use case. Most codec comparisons in the past focused on evaluating what can be achieved by the bitstream syntax (using the reference software), applied settings that do not fully reflect our encoding scenario, or only covered a limited set of videos. Our goal was to assess what can be achieved by encoding with practical codecs that can be deployed to a production pipeline, on the Netflix catalog of movies and TV shows, with encoding parameters that are useful to a streaming service. We sampled 5000 12-second clips from our catalog, covering a wide range of genres and signal characteristics. With 3 codecs, 2 configurations, 3 resolutions (480p, 720p and 1080p) and 8 quality levels per configuration-resolution pair, we generated more than 200 million encoded frames. We applied six quality metrics - PSNR, PSNRMSE, SSIM, MS-SSIM, VIF and VMAF - resulting in more than half a million bitrate-quality curves. This encoding work required significant compute capacity. However, our cloud-based encoding infrastructure, which leverages unused Netflix-reserved AWS web servers dynamically, enabled us to complete the experiments in just a few weeks.

What did we learn?
Here’s a snapshot: x265 and libvpx demonstrate superior compression performance compared to x264, with bitrate savings reaching up to 50% especially at the higher resolutions. x265 outperforms libvpx for almost all resolutions and quality metrics, but the performance gap narrows (or even reverses) at 1080p.

Want to know more?
We will present our methodology and results this coming Wednesday, August 31, 8:00 am PDT at the SPIE Applications of Digital Image Processing conference, Session 7: Royalty-free Video. We will stream the whole session live on Periscope and YouTube: follow Anne for notifications or come back to this page for links to the live streams. This session will feature other interesting technical work from leaders in the field of Royalty-Free Video. We will also follow-up with a more detailed tech blog post and extend the results to include 4K encodes.

By Jan De Cock, Aditya Mavlankar, Anush Moorthy and Anne Aaron

Netflix Data Benchmark: Benchmarking Cloud Data Stores

$
0
0
The Netflix member experience is offered to 83+ million global members, and delivered using thousands of microservices. These services are owned by multiple teams, each having their own build and release lifecycles, generating a variety of data that is stored in different types of data store systems. The Cloud Database Engineering (CDE) team manages those data store systems, so we run benchmarks to validate updates to these systems, perform capacity planning, and test our cloud instances with multiple workloads and under different failure scenarios. We were also interested in a tool that could evaluate and compare new data store systems as they appear in the market or in the open source domain, determine their performance characteristics and limitations, and gauge whether they could be used in production for relevant use cases. For these purposes, we wrote Netflix Data Benchmark (NDBench), a pluggable cloud-enabled benchmarking tool that can be used across any data store system. NDBench provides plugin support for the major data store systems that we use -- Cassandra (Thrift and CQL), Dynomite (Redis), and Elasticsearch. It can also be extended to other client APIs.

Introduction

As Netflix runs thousands of microservices, we are not always aware of the traffic that bundled microservices may generate on our backend systems. Understanding the performance implications of new microservices on our backend systems was also a difficult task. We needed a framework that could assist us in determining the behavior of our data store systems under various workloads, maintenance operations and instance types. We wanted to be mindful of provisioning our clusters, scaling them either horizontally (by adding nodes) or vertically (by upgrading the instance types), and operating under different workloads and conditions, such as node failures, network partitions, etc.


As new data store systems appear in the market, they tend to report performance numbers for the “sweet spot”, and are usually based on optimized hardware and benchmark configurations. Being a cloud-native database team, we want to make sure that our systems can provide high availability under multiple failure scenarios, and that we are utilizing our instance resources optimally. There are many other factors that affect the performance of a database deployed in the cloud, such as instance types, workload patterns, and types of deployments (island vs global). NDBench aids in simulating the performance benchmark by mimicking several production use cases.


There were also some additional requirements; for example, as we upgrade our data store systems (such as Cassandra upgrades) we wanted to test the systems prior to deploying them in production. For systems that we develop in-house, such as Dynomite, we wanted to automate the functional test pipelines, understand the performance of Dynomite under various conditions, and under different storage engines. Hence, we wanted a workload generator that could be integrated into our pipelines prior to promoting an AWS AMI to a production-ready AMI.


We looked into various benchmark tools as well as REST-based performance tools. While some tools covered a subset of our requirements, we were interested in a tool that could achieve the following:
  • Dynamically change the benchmark configurations while the test is running, hence perform tests along with our production microservices.
  • Be able to integrate with platform cloud services such as dynamic configurations, discovery, metrics, etc.
  • Run for an infinite duration in order to introduce failure scenarios and test long running maintenances such as database repairs.
  • Provide pluggable patterns and loads.
  • Support different client APIs.
  • Deploy, manage and monitor multiple instances from a single entry point.
For these reasons, we created Netflix Data Benchmark (NDBench). We incorporated NDBench into the Netflix Open Source ecosystem by integrating it with components such as Archaius for configuration, Spectator for metrics, and Eureka for discovery service. However, we designed NDBench so that these libraries are injected, allowing the tool to be ported to other cloud environments, run locally, and at the same time satisfy our Netflix OSS ecosystem users.

NDBench Architecture

The following diagram shows the architecture of NDBench. The framework consists of three components:
  • Core: The workload generator
  • API: Allowing multiple plugins to be developed against NDBench
  • Web: The UI and the servlet context listener
We currently provide the following client plugins -- Datastax Java Driver (CQL), C* Astyanax (Thrift), Elasticsearch API, and Dyno (Jedis support). Additional plugins can be added, or a user can use dynamic scripts in Groovy to add new workloads. Each driver is just an implementation of the Driver plugin interface.

NDBench-core is the core component of NDBench, where one can further tune workload settings.


Fig. 1: NDBench Architecture


NDBench can be used from either the command line (using REST calls), or from a web-based user interface (UI).

NDBench Runner UI

ndbench-ui-2.jpg
Fig.2: NDBench Runner UI


A screenshot of the NDBench Runner (Web UI) is shown in Figure 2. Through this UI, a user can select a cluster, connect a driver, modify settings, set a load testing pattern (random or sliding window), and finally run the load tests. Selecting an instance while a load test is running also enables the user to view live-updating statistics, such as read/write latencies, requests per second, cache hits vs. misses, and more.

Load Properties

NDBench provides a variety of input parameters that are loaded dynamically and can dynamically change during the workload test. The following parameters can be configured on a per node basis:
  • numKeys: the sample space for the randomly generated keys
  • numValues: the sample space for the generated values
  • dataSize: the size of each value
  • numWriters/numReaders: the number of threads per NDBench node for writes/reads
  • writeEnabled/readEnabled: boolean to enable or disable writes or reads
  • writeRateLimit/readRateLimit: the number of writes per second and reads per seconds
  • userVariableDataSize: boolean to enable or disable the ability of the payload to be randomly generated.

Types of Workload

NDBench offers pluggable load tests. Currently it offers two modes -- random traffic and sliding window traffic. The sliding window test is a more sophisticated test that can concurrently exercise data that is repetitive inside the window, thereby providing a combination of temporally local data and spatially local data. This test is important as we want to exercise both the caching layer provided by the data store system, as well as the disk’s IOPS (Input/Output Operations Per Second).

Load Generation

Load can be generated individually for each node on the application side, or all nodes can generate reads and writes simultaneously. Moreover, NDBench provides the ability to use the “backfill” feature in order to start the workload with hot data. This helps in reducing the ramp up time of the benchmark.

NDBench at Netflix

NDBench has been widely used inside Netflix. In the following sections, we talk about some use cases in which NDBench has proven to be a useful tool.

Benchmarking Tool

A couple of months ago, we finished the Cassandra migration from version 2.0 to 2.1. Prior to starting the process, it was imperative for us to understand the performance gains that we would achieve, as well as the performance hit we would incur during the rolling upgrade of our Cassandra instances. Figures 3 and 4 below illustrate  the p99 and p95 read latency differences using NDBench. In Fig. 3, we highlight the differences between Cassandra 2.0 (blue line) vs 2.1 (brown line).















Fig.3: Capturing OPS and latency percentiles of Cassandra


Last year, we also migrated all our Cassandra instances from the older Red Hat 5.10 OS to Ubuntu 14.04 (Trusty Tahr). We used NDBench to measure performance under the newer operating system. In Figure 4, we showcase the three phases of the migration process by using NDBench’s long-running benchmark capability. We used rolling terminations of the Cassandra instances to update the AMIs with the new OS, and NDBench to verify that there would be no client-side impact during the migration. NDBench also allowed us to validate that the performance of the new OS was better after the migration.


Fig.4: Performance improvement from our upgrade from Red Hat 5.10 to Ubuntu 14.04

AMI Certification Process

NDBench is also part of our AMI certification process, which consists of integration tests and deployment validation. We designed pipelines in Spinnaker and integrated NDBench into them. The following figure shows the bakery-to-release lifecycle. We initially bake an AMI with Cassandra, create a Cassandra cluster, create an NDBench cluster, configure it, and run a performance test. We finally review the results, and make the decision on whether to promote an “Experimental” AMI to a “Candidate”. We use similar pipelines for Dynomite, testing out the replication functionalities with different client-side APIs. Passing the NDBench performance tests means that the AMI is ready to be used in the production environment. Similar pipelines are used across the board for other data store systems at Netflix.
Fig.5 NDBench integrated with Spinnaker pipelines


In the past, we’ve published benchmarks of Dynomite with Redis as a storage engine leveraging NDBench. In Fig. 6 we show some of the higher percentile latencies we derived from Dynomite leveraging NDBench.
Fig.6: P99 latencies for Dynomite with consistency set to DC_QUORUM with NDBench


NDBench allows us to run infinite horizon tests to identify potential memory leaks from long running processes that we develop or use in-house. At the same time, in our integration tests we introduce failure conditions, change the underlying variables of our systems, introduce CPU intensive operations (like repair/reconciliation), and determine the optimal performance based on the application requirements. Finally, our sidecars such as Priam, Dynomite-manager and Raigad perform various activities, such as multi-threaded backups to object storage systems. We want to make sure, through integration tests, that the performance of our data store systems is not affected.

Conclusion

For the last few years, NDBench has been a widely-used tool for functional, integration, and performance testing, as well as AMI validation. The ability to change the workload patterns during a test, support for different client APIs, and integration with our cloud deployments has greatly helped us in validating our data store systems. There are a number of improvements we would like to make to NDBench, both for increased usability and supporting additional features. Some of the features that we would like to work on include:
  • Performance profile management
  • Automated canary analysis
  • Dynamic load generation based on destination schemas
NDBench has proven to be extremely useful for us on the Cloud Database Engineering team at Netflix, and we are happy to have the opportunity to share that value. Therefore, we are releasing NDBench as an open source project, and are looking forward to receiving feedback, ideas, and contributions from the open source community. You can find NDBench on Github at: https://github.com/Netflix/ndbench


If you enjoy the challenges of building distributed systems and are interested in working with the Cloud Database Engineering team in solving next-generation data store problems, check out ourjob openings.


Authors: Vinay Chella, Ioannis Papapanagiotou, and Kunal Kundaje

Netflix OSS Meetup Recap - September 2016

$
0
0
Last week, we welcomed roughly 200 attendees to Netflix HQ in Los Gatos for Season 4, Episode 3 of our Netflix OSS Meetup. The meetup group was created in 2013 to discuss our various OSS projects amongst the broader community of OSS enthusiasts. This episode centered around security-focused OSS releases, and speakers included both Netflix creators of security OSS as well as community users and contributors.

We started the night with an hour of networking, Mexican food, and drinks. As we kicked off the presentations, we discussed the history of security OSS at Netflix - we first released Security Monkey in 2014, and we're closing in on our tenth security release, likely by the end of 2016. The slide below provides a comprehensive timeline of the security software we've released as Netflix OSS.



Wes Miaw of Netflix began the presentations with a discussion of MSL (Message Security Layer), a modern security protocol that addresses a number of difficult security problems. Next was Patrick Kelley, also of Netflix, who gave the crowd an overview of Repoman, an upcoming OSS release that works to right-size permissions within Amazon Web Services environments.

Next up were our external speakers. Vivian Ho and Ryan Lane of Lyft discussed their use of BLESS, an SSH Certificate Authority implemented as an AWS Lambda function. They're using it in conjunction with their OSS kmsauth to provide engineers SSH access to AWS instances. Closing the presentations was Chris Dorros of OpenDNS/Cisco. Chris talked about his contribution to Lemur, the SSL/TLS certificate management system we open sourced last year. Chris has added functionality to support the DigiCert Certificate Authority. After the presentations, the crowd moved back to the cafeteria, where we'd set up demo stations for a variety of our security OSS releases.

Patrick Kelley talking about Repoman


Thanks to everyone who attended - we're planning the next meetup for early December 2016. Join our group for notifications. If you weren't able to attend, we have both the slides and video available.

Upcoming Talks from the Netflix Security Team

Below is a schedule of upcoming presentations from members of the Netflix security team (through 2016). If you'd like to hear more talks from Netflix security, some of our past presentations are available on our YouTube channel



Speakers
Conference
Talk
Automacon (Portland, OR) Sept 27-29, 2016
Scott Behrens and Andy Hoernecke
AppSecUSA 2016 (DC) - Oct 11-14, 2016
Scott Behrens and Andy Hoernecke
O'Reilly Security NYC (NYC) - Oct 30-Nov 2, 2016
Ping Identify SF (San Francisco) - Nov 2, 2016
Co-Keynote
QConSF (San Francisco) - Nov 7-11, 2016
The Psychology of Security Automation
Manish Mehta
AWS RE:invent (Las Vegas) - Nov 28-Dec 2, 2016
Solving the First Secret Problem: Securely Establishing Identity using the AWS Metadata Service
AWS RE:invent (Las Vegas) - Nov 28-Dec 2, 2016

If you're interested in solving interesting security problems while developing OSS that the rest of the world can use, we'd love to hear from you! Please see our jobs site for openings.

By Jason Chan


IMF: AN OPEN STANDARD WITH OPEN TOOLS

$
0
0

Why IMF?


As Netflix expanded into a global entertainment platform, our supply chain needed an efficient way to vault our masters in the cloud that didn’t require a different version for every territory in which we have our service.  A few years ago we discovered the Interoperable Master Format (IMF), a standard created by the Society of Motion Picture and Television Engineers (SMPTE). The IMF framework is based on the Digital Cinema standard of component based elements in a standard container with assets being mapped together via metadata instructions.  By using this standard, Netflix is able to hold a single set of core assets and the unique elements needed to make those assets relevant in a local territory.  So for a title like Narcos, where the video is largely the same in all territories, we can hold the Primary AV and the specific frames that are different for, say, the Japanese title sequence version.  This reduces duplication of assets that are 95% the same and allows us to hold that 95% once and piece it to the 5% differences needed for a specific use case.   The format also serves to minimize the risk of multiple versions being introduced into our vault, and allows us to keep better track of our assets, as they stay within one contained package, even when new elements are introduced.  This allows us to avoid “versionitis” as outlined in this previous blog.  We can leverage one set of master assets and utilize supplemental or additional master assets in IMF to make our localized language versions, as well as any transcoded versions, without needing to store anything more than master materials.  Primary AV, supplemental AV, subtitles, non-English audio and other assets needed for global distribution can all live in an “uber” master that can be continually added to as needed rather than recreated.  When a “virtual-version” is needed, the instructions simply need to be created, not the whole master.  IMF provides maximum flexibility without having to actually create every permutation of a master.  


OSS for IMF:


Netflix has a history of identifying shared problems within industries and seeking solutions via open source tools.  Because many of our content partners have the same issues Netflix has with regard to global versions of their content, we saw IMF as a shared opportunity in the digital supply chain space.  In order to support IMF interoperability and share the benefits of the format with the rest of the content community, we have invested in several open source IMF tools.  One example of these tools is the IMF Transform Tool which gives users the ability to transcode from IMF to DPP (Digital Production Partnership).  Realizing Netflix is only one recipient of assets from content owners, we wanted to create a solution that would allow them to enjoy the benefits of IMF and still create deliverables to existing outlets.  Similarly, Netflix understands the EST business is still important to content owners, so we’re adding another open source transform tool that will go from IMF to an iTunes-compatible like package (when using Apple ProRes encoder). This will allow users to take a SMPTE compliant IMF and convert it to a package which can be used for TVOD delivery without incurring significant costs via proprietary tools.  A final shared problem is editing those sets of instructions we mentioned earlier.  There are many great tools in the marketplace that create IMF packages, and while they are fully featured and offer powerful solutions for creating IMFs, they can be overkill for making quick changes to a CPL (Content Play List).  Things like adding metadata markers,EIDRnumbers or other changes to the instructions for that IMF can all be done in our newly released OSS IMF CPL Editor.  This leaves the fully functioned commercial software/hardware tools open in facilities for IMF creation and not tied up making small changes to metadata.


IMF Transforms

The IMF Transform uses other open source technologies from Java, ffmpeg, bmxlib and x.264 in the framework.  These tools and their source code can be found on GitHub at


IMF CPL Editor


The IMF CPL Editor is cross platform and can be compiled on Mac, Windows and/or Linux operating systems.  The tool will open a composition playlist (CPL) in a timeline and list all assets.  The essence files will be supported in .mxf wrapped .wav, .ttml or .imsc files. The user can add, edit and delete audio, subtitle and metadata assets from the timeline. The edits can be saved back to the existing CPL or saved as a new CPL modifying the Packing List (PKL) and Asset Map as well.  The source code and compiled tool will be open source and available at (https://github.com/IMFTool)


What’s Next:


We hope others will branch these open source efforts and make even more functions available to the growing community of IMF users.  It would be great to see a transform function to other AS-11 formats, XDCAM 50 or other widely used broadcast “play-out” formats.  In addition to the base package functionality that currently exists, Netflix will be adding supplemental package support to the IMF CPL Editor in October. We look forward to seeing what developers create. These solutions coupled with thePhotontool Netflix has already released create strong foundations to make having an efficient and comprehensive library in IMF an achievable goal for content owners seeking to exploit their assets in the global entertainment market.

By: Chris Fetner and Brian Kenworthy

Zuul 2 : The Netflix Journey to Asynchronous, Non-Blocking Systems

$
0
0
We recently made a major architectural change to Zuul, our cloud gateway. Did anyone even notice!?  Probably not... Zuul 2 does the same thing that its predecessor did -- acting as the front door to Netflix’s server infrastructure, handling traffic from all Netflix users around the world.  It also routes requests, supports developers’ testing and debugging, provides deep insight into our overall service health, protects Netflix from attacks, and channels traffic to other cloud regions when an AWS region is in trouble. The major architectural difference between Zuul 2 and the original is that Zuul 2 is running on an asynchronous and non-blocking framework, using Netty.  After running in production for the last several months, the primary advantage (one that we expected when embarking on this work) is that it provides the capability for devices and web browsers to have persistent connections back to Netflix at Netflix scale.  With more than 83 million members, each with multiple connected devices, this is a massive scale challenge.  By having a persistent connection to our cloud infrastructure, we can enable lots of interesting product features and innovations, reduce overall device requests, improve device performance, and understand and debug the customer experience better.  We also hoped the Zuul 2 would offer resiliency benefits and performance improvements, in terms of latencies, throughput, and costs.  But as you will learn in this post, our aspirations have differed from the results.

Differences Between Blocking vs. Non-Blocking Systems

To understand why we built Zuul 2, you must first understand the architectural differences between asynchronous and non-blocking (“async”) systems vs. multithreaded, blocking (“blocking”) systems, both in theory and in practice.  

Zuul 1 was built on the Servlet framework. Such systems are blocking and multithreaded, which means they process requests by using one thread per connection. I/O operations are done by choosing a worker thread from a thread pool to execute the I/O, and the request thread is blocked until the worker thread completes. The worker thread notifies the request thread when its work is complete. This works well with modern multi-core AWS instances handling 100’s of concurrent connections each. But when things go wrong, like backend latency increases or device retries due to errors, the count of active connections and threads increases. When this happens, nodes get into trouble and can go into a death spiral where backed up threads spike server loads and overwhelm the cluster.  To offset these risks, we built in throttling mechanisms and libraries (e.g., Hystrix) to help keep our blocking systems stable during these events.


Multithreaded System Architecture

Async systems operate differently, with generally one thread per CPU core handling all requests and responses. The lifecycle of the request and response is handled through events and callbacks. Because there is not a thread for each request, the cost of connections is cheap. This is the cost of a file descriptor, and the addition of a listener. Whereas the cost of a connection in the blocking model is a thread and with heavy memory and system overhead. There are some efficiency gains because data stays on the same CPU, making better use of CPU level caches and requiring fewer context switches. The fallout of backend latency and “retry storms” (customers and devices retrying requests when problems occur) is also less stressful on the system because connections and increased events in the queue are far less expensive than piling up threads.


Asynchronous and Non-blocking System Architecture

The advantages of async systems sound glorious, but the above benefits come at a cost to operations. Blocking systems are easy to grok and debug. A thread is always doing a single operation so the thread’s stack is an accurate snapshot of the progress of a request or spawned task; and a thread dump can be read to follow a request spanning multiple threads by following locks. An exception thrown just pops up the stack. A “catch-all” exception handler can cleanup everything that isn’t explicitly caught.   

Async, by contrast, is callback based and driven by an event loop. The event loop’s stack trace is meaningless when trying to follow a request. It is difficult to follow a request as events and callbacks are processed, and the tools to help with debugging this are sorely lacking in this area. Edge cases, unhandled exceptions, and incorrectly handled state changes create dangling resources resulting in ByteBuf leaks, file descriptor leaks, lost responses, etc. These types of issues have proven to be quite difficult to debug because it is difficult to know which event wasn’t handled properly or cleaned up appropriately.


Building Non-Blocking Zuul

Building Zuul 2 within Netflix’s infrastructure was more challenging than expected. Many services within the Netflix ecosystem were built with an assumption of blocking.  Netflix’s core networking libraries are also built with blocking architectural assumptions; many libraries rely on thread local variables to build up and store context about a request. Thread local variables don’t work in an async non-blocking world where multiple requests are processed on the same thread.  Consequently, much of the complexity of building Zuul 2 was in teasing out dark corners where thread local variables were being used. Other challenges involved converting blocking networking logic into non-blocking networking code, and finding blocking code deep inside libraries, fixing resource leaks, and converting core infrastructure to run asynchronously.  There is no one-size-fits-all strategy for converting blocking network logic to async; they must be individually analyzed and refactored. The same applies to core Netflix libraries, where some code was modified and some had to be forked and refactored to work with async.  The open source project Reactive-Audit was helpful by instrumenting our servers to discover cases where code blocks and libraries were blocking.

We took an interesting approach to building Zuul 2. Because blocking systems can run code asynchronously, we started by first changing our Zuul Filters and filter chaining code to run asynchronously.  Zuul Filters contain the specific logic that we create to do our gateway functions (routing, logging, reverse proxying, ddos prevention, etc). We refactored core Zuul, the base Zuul Filter classes, and our Zuul Filters using RxJava to allow them to run asynchronously. We now have two types of filters that are used together: async used for I/O operations, and a sync filter that run logical operations that don’t require I/O.  Async Zuul Filters allowed us to execute the exact same filter logic in both a blocking system and a non-blocking system.  This gave us the ability to work with one filter set so that we could develop gateway features for our partners while also developing the Netty-based architecture in a single codebase. With async Zuul Filters in place, building Zuul 2 was “just” a matter of making the rest of our Zuul infrastructure run asynchronously and non-blocking. The same Zuul Filters could just drop into both architectures.


Results of Zuul 2 in Production

Hypotheses varied greatly on benefits of async architecture with our gateway. Some thought we would see an order of magnitude increase in efficiency due to the reduction of context switching and more efficient use of CPU caches and others expected that we’d see no efficiency gain at all.  Opinions also varied on the complexity of the change and development effort. 

So what did we gain by doing this architectural change? And was it worth it? This topic is hotly debated. The Cloud Gateway team pioneered the effort to create and test async-based services at Netflix. There was a lot of interest in understanding how microservices using async would operate at Netflix, and Zuul looked like an ideal service for seeing benefits. 


While we did not see a significant efficiency benefit in migrating to async and non-blocking, we did achieve the goals of connection scaling. Zuul does benefit by greatly decreasing the cost of network connections which will enable push and bi-directional communication to and from devices. These features will enable more real-time user experience innovations and will reduce overall cloud costs by replacing “chatty” device protocols today (which account for a significant portion of API traffic) with push notifications. There also is some resiliency advantage in handling retry storms and latency from origin systems better than the blocking model. We are continuing to improve on this area; however it should be noted that the resiliency advantages have not been straightforward or without effort and tuning. 


With the ability to drop Zuul’s core business logic into either blocking or async architectures, we have an interesting apples-to-apples comparison of blocking to async.  So how do two systems doing the exact same real work, although in very different ways, compare in terms of features, performance and resiliency?  After running Zuul 2 in production for the last several months, our evaluation is that the more CPU-bound a system is, the less of an efficiency gain we see.  


We have several different Zuul clusters that front origin services like API, playback, website, and logging. Each origin service demands that different operations be handled by the corresponding Zuul cluster.  The Zuul cluster that fronts our API service, for example, does the most on-box work of all our clusters, including metrics calculations, logging, and decrypting incoming payloads and compressing responses.  We see no efficiency gain by swapping an async Zuul 2 for a blocking one for this cluster.  From a capacity and CPU point of view they are essentially equivalent, which makes sense given how CPU-intensive the Zuul service fronting API is. They also tend to degrade at about the same throughput per node. 


The Zuul cluster that fronts our Logging services has a different performance profile. Zuul is generally receiving logging and analytics messages from devices and is write-heavy, so requests are large, but responses are small and not encrypted by Zuul.  As a result, Zuul is doing much less work for this cluster.  While still CPU-bound, we see about a 25% increase in throughput corresponding with a 25% reduction in CPU utilization by running Netty-based Zuul.  We thus observed that the less work a system actually does, the more efficiency we gain from async. 


Overall, the value we get from this architectural change is high, with connection scaling being the primary benefit, but it does come at a cost. We have a system that is much more complex to debug, code, and test, and we are working within an ecosystem at Netflix that operates on an assumption of blocking systems. It is unlikely that the ecosystem will change anytime soon, so as we add and integrate more features to our gateway it is likely that we will need to continue to tease out thread local variables and other assumptions of blocking in client libraries and other supporting code.  We will also need to rewrite blocking calls asynchronously.  This is an engineering challenge unique to working with a well established platform and body of code that makes assumptions of blocking. Building and integrating Zuul 2 in a greenfield would have avoided some of these complexities, but we operate in an environment where these libraries and services are essential to the functionality of our gateway and operation within Netflix’s ecosystem.


We are in the process of releasing Zuul 2 as open source. Once it is released, we’d love to hear from you about your experiences with it and hope you will share your contributions! We plan on adding new features such as http/2 and websocket support to Zuul 2 so that the community can also benefit from these innovations.


- The Cloud Gateway Team (Mikey Cohen, Mike Smith, Susheel Aroskar, Arthur Gonigberg, Gayathri Varadarajan, and Sudheer Vinukonda)




To Be Continued: Helping you find shows to continue watching on Netflix

$
0
0

Introduction

Our objective in improving the Netflix recommendation system is to create a personalized experience that makes it easier for our members to find great content to enjoy. The ultimate goal of our recommendation system is to know the exact perfect show for the member and just start playing it when they open Netflix. While we still have a long way to achieve that goal, there are areas where we can reduce the gap significantly.

When a member opens the Netflix website or app, she may be looking to discover a new movie or TV show that she never watched before, or, alternatively, she may want to continue watching a partially-watched movie or a TV show she has been binging on. If we can reasonably predict when a member is more likely to be in the continuation mode and which shows she is more likely to resume, it makes sense to place those shows in prominent places on the home page.
While most recommendation work focuses on discovery, in this post, we focus on the continuation mode and explain how we used machine learning to improve the member experience for both modes. In particular, we focus on a row called “Continue Watching” (CW) that appears on the main page of the Netflix member homepage on most platforms. This row serves as an easy way to find shows that the member has recently (partially) watched and may want to resume. As you can imagine, a significant proportion of member streaming hours are spent on content played from this row.


Continue Watching

Previously, the Netflix app in some platforms displayed a row with recently watched shows (here we use the term show broadly to include all forms of video content on Netflix including movies and TV series) sorted by recency of last time each show was played. How the row was placed on the page was determined by some rules that depended on the device type. For example, the website only displayed a single continuation show on the top-left corner of the page. While these are reasonable baselines, we set out to unify the member experience of CW row across platforms and improve it along two dimensions:

  • Improve the placement of the row on the page by placing it higher when a member is more likely to resume a show (continuation mode), and lower when a member is more likely to look for a new show to watch (discovery mode)
  • Improve the ordering of recently-watched shows in the row using their likelihood to be resumed in the current session


Intuitively, there are a number of activity patterns that might indicate a member’s likelihood to be in the continuation mode. For example, a member is perhaps likely to resume a show if she:

  • is in the middle of a binge; i.e., has been recently spending a significant amount of time watching a TV show, but hasn’t yet reached its end
  • has partially watched a movie recently
  • has often watched the show around the current time of the day or on the current device
On the other hand, a discovery session is more likely if a member:
  • has just finished watching a movie or all episodes of a TV show
  • hasn’t watched anything recently
  • is new to the service
These hypotheses, along with the high fraction of streaming hours spent by members in continuation mode, motivated us to build machine learning models that can identify and harness these patterns to produce a more effective CW row.

Building a Recommendation Model for Continue Watching

To build a recommendation model for the CW row, we first need to compute a collection of features that extract patterns of the behavior that could help the model predict when someone will resume a show. These may include features about the member, the shows in the CW row, the member’s past interactions with those shows, and some contextual information. We then use these features as inputs to build machine learning models. Through an iterative process of variable selection, model training, and cross validation, we can refine and select the most relevant set of features.

While brainstorming for features, we considered many ideas for building the CW models, including:

  1. Member-level features:
    • Data about member’s subscription, such as the length of subscription, country of signup, and language preferences
    • How active has the member been recently
    • Member’s past ratings and genre preferences
  2. Features encoding information about a show and interactions of the member with it:
    • How recently was the show added to the catalog, or watched by the member
    • How much of the movie/show the member watched
    • Metadata about the show, such as type, genre, and number of episodes; for example kids shows may be re-watched more
    • The rest of the catalog available to the member
    • Popularity and relevance of the show to the member
    • How often do the members resume this show
  3. Contextual features:
    • Current time of the day and day of the week
    • Location, at various resolutions
    • Devices used by the member

Two applications, two models


As mentioned above, we have two tasks related to organizing a member's continue watching shows: ranking the shows within the CW row and placing the CW row appropriately on the member’s homepage.

Show ranking


To rank the shows within the row, we trained a model that optimizes a ranking loss function. To train it, we used sessions where the member resumed a previously-watched show - i.e., continuation sessions - from a random set of members. Within each session, the model learns to differentiate amongst candidate shows for continuation and ranks them in the order of predicted likelihood of play. When building the model, we placed special importance on having the model place the show of play at first position.

We performed an offline evaluation to understand how well the model ranks the shows in the CW row. Our baseline for comparison was the previous system, where the shows were simply sorted by recency of last time each show was played. This recency rank is a strong baseline (much better than random) and is also used as a feature in our new model. Comparing the model vs. recency ranking, we observed significant lift in various offline metrics. The figure below displays Precision@1 of the two schemes over time. One can see that the lift in performance is much greater than the daily variation.




This model performed significantly better than recency-based ranking in an A/B test and better matched our expectations for member behavior. As an example, we learned that the members whose rows were ranked using the new model had fewer plays originating from the search page. This meant that many members had been resorting to searching for a recently-watched show because they could not easily locate it on the home page; a suboptimal experience that the model helped ameliorate.

Row placement


To place the CW row appropriately on a member’s homepage, we would like to estimate the likelihood of the member being in a continuation mode vs. a discovery mode. With that likelihood we could take different approaches. A simple approach would be to turn row placement into a binary decision problem where we consider only two candidate positions for the CW row: one position high on the page and another one lower down. By applying a threshold on the estimated likelihood of continuation, we can decide in which of these two positions to place the CW row. That threshold could be tuned to optimize some accuracy metrics. Another approach is to take the likelihood and then map it onto different positions, possibly based on the content at that location on the page. In any case, getting a good estimate of the continuation likelihood is critical for determining the row placement. In the following, we discuss two potential approaches for estimating the likelihood of the member operating in a continuation mode.

Reusing the show-ranking model


A simple approach to estimating the likelihood of continuation vs. discovery is to reuse the scores predicted by the show-ranking model. More specifically, we could calibrate the scores of individual shows in order to estimate the probability P(play(s)=1) that each show s will be resumed in the given session. We can use these individual probabilities over all the shows in the CW row to obtain an overall probability of continuation; i.e., the probability that at least one show from the CW row will be resumed. For example, under a simple assumption of independence of different plays, we can write the probability that at least one show from the CW row will be played as:

Dedicated row model


In this approach, we train a binary classifier to differentiate between continuation sessions as positive labels and sessions where the user played a show for the first time (discovery sessions) as negative labels. Potential features for this model could include member-level and contextual features, as well as the interactions of the member with the most recent shows in the viewing history.
Comparing the two approaches, the first approach is simpler because it only requires having a single model as long as the probabilities are well calibrated. However, the second one is likely to provide a more accurate estimate of continuation because we can train a classifier specifically for it.

Tuning the placement


In our experiments, we evaluated our estimates of continuation likelihood using classification metrics and achieved good offline metrics. However, a challenge that still remains is to find an optimal mapping for that estimated likelihood, i.e., to balance continuation and discovery. In this case, varying the placement creates a trade-off between two types of errors in our prediction: false positives (where we incorrectly predict that the member wants to resume a show from the CW row) and false negatives (where we incorrectly predict that the member wants to discover new content). These two types of errors have different impacts on the member. In particular, a false negative makes it harder for members to continue bingeing on a show. While experienced members can find the show by scrolling down the page or by using the search functionality, the additional friction can make it more difficult for people new to the service. On the other hand, a false positive leads to wasted screen real estate, which could have been used to display more relevant recommendation shows for discovery. Since the impacts of the two types of errors on the member experience are difficult to measure accurately offline, we A/B tested different placement mappings and were able to learn the appropriate value from online experiments leading to the highest member engagement.

Context Awareness


One of our hypotheses was that continuation behavior depends on context: time, location, device, etc. If that is the case, given proper features, the trained models should be able to detect those patterns and adapt the predicted probability of resuming shows based on the current context of a member. For example, members may have habits of watching a certain show around the same time of the day (for example, watching comedies at around 10 PM on weekdays). As an example of context awareness, the following screenshots demonstrate how the model uses contextual features to distinguish between the behavior of a member on different devices. In this example, the profile has just watched a few minutes of the show “Sid the Science Kid” on an iPhone and the show “Narcos” on the Netflix website. In response, the CW model immediately ranks “Sid the Science Kid” at the top position of the CW row on the iPhone, and puts “Narcos” at the first position on the website.


Serving the Row

Members expect the CW row to be responsive and change dynamically after they watch a show. Moreover, some of the features in the model are time and device dependent and can not be precomputed in advance, which is an approach we use for some of our recommendation systems. Therefore, we need to compute the CW row in real-time to make sure it is fresh when we get a request for a homepage at the start of a session. To keep it fresh, we also need to update it within a session after certain user interactions and immediately push that update to the client to update their homepage. Computing the row on-the-fly at our scale is challenging and requires careful engineering. For example, some features are more expensive to compute for the users with longer viewing history, but we need to have reasonable response times for all members because continuation is a very common scenario. We collaborated with several engineering teams to create a dynamic and scalable way for serving the row to address these challenges.

Conclusion

Having a better Continue Watching row clearly makes it easier for our members to jump right back into the content they are enjoying while also getting out of the way when they want to discover something new. While we’ve taken a few steps towards improving this experience, there are still many areas for improvement. One challenge is that we seek to unify how we place this row with respect to the rest of the rows on the homepage, which are predominantly focused on discovery. This is challenging because different algorithms are designed to optimize for different actions, so we need a way to balance them. We also want to be thoughtful about pushing CW too much; we want people to “Binge Responsibly” and also explore new content. We also have details to dig into like how to determine if a show is actually finished by a user so we can remove it from the row. This can be complicated by scenarios such as if someone turned off their TV but not the playing device or fell asleep watching. We also keep an eye out for new ways to use the CW model in other aspects of the product.
Can’t wait to see how the Netflix Recommendation saga continues? Join us in tackling these kinds of algorithmic challenges and help write the next episode.

Netflix Chaos Monkey Upgraded

$
0
0

We are pleased to announce a significant upgrade to one of our more popular OSS projects.  Chaos Monkey 2.0 is now on github!

Years ago, we decided to improve the resiliency of our microservice architecture.  At our scale it is guaranteed that servers on our cloud platform will sometimes suddenly fail or disappear without warning.  If we don’t have proper redundancy and automation, these disappearing servers could cause service problems.

The Freedom and Responsibility culture at Netflix doesn’t have a mechanism to force engineers to architect their code in any specific way.  Instead, we found that we could build strong alignment around resiliency by taking the pain of disappearing servers and bringing that pain forward.  We created Chaos Monkey to randomly choose servers in our production environment and turn them off during business hours.  Some people thought this was crazy, but we couldn’t depend on the infrequent occurrence to impact behavior.  Knowing that this would happen on a frequent basis created strong alignment among our engineers to build in the redundancy and automation to survive this type of incident without any impact to the millions of Netflix members around the world.

We value Chaos Monkey as a highly effective tool for improving the quality of our service.  Now Chaos Monkey has evolved.  We rewrote the service for improved maintainability and added some great new features.  The evolution of Chaos Monkey is part of our commitment to keep our open source software up to date with our current environment and needs.

Integration with Spinnaker

Chaos Monkey 2.0 is fully integrated with Spinnaker, our continuous delivery platform.
Service owners set their Chaos Monkey configs through the Spinnaker apps, Chaos Monkey gets information about how services are deployed from Spinnaker, and Chaos Monkey terminates instances through Spinnaker.

Since Spinnaker works with multiple cloud backends, Chaos Monkey does as well. In the Netflix environment, Chaos Monkey terminates virtual machine instances running on AWS and Docker containers running on Titus, our container cloud.

Integration with Spinnaker gave us the opportunity to improve the UX as well.  We interviewed our internal customers and came up with a more intuitive method of scheduling terminations.  Service owners can now express a schedule in terms of the mean time between terminations, rather than a probability over an arbitrary period of time.  We also added grouping by app, stack, or cluster, so that applications that have different redundancy architectures can schedule Chaos Monkey appropriate to their configuration. Chaos Monkey now also supports specifying exceptions so users can opt out specific clusters.  Some engineers at Netflix use this feature to opt out small clusters that are used for testing.

Chaos Monkey Spinnaker UI
Tracking Terminations

Chaos Monkey can now be configured for specifying trackers.  These external services will receive a notification when Chaos Monkey terminates an instance.  Internally, we use this feature to report metrics into Atlas, our telemetry platform, and Chronos, our event tracking system.  The graph below, taken from Atlas UI, shows the number of Chaos Monkey terminations for a segment of our service.  We can see chaos in action.  Chaos Monkey even periodically terminates itself.

Chaos Monkey termination metrics in Atlas
Termination Only

Netflix only uses Chaos Monkey to terminate instances.  Previous versions of Chaos Monkey allowed the service to ssh into a box and perform other actions like burning up CPU, taking disks offline, etc.  If you currently use one of the prior versions of Chaos Monkey to run an experiment that involves anything other than turning off an instance, you may not want to upgrade since you would lose that functionality.

Finale

We also used this opportunity to introduce many small features such as automatic opt-out for canaries, cross-account terminations, and automatic disabling during an outage.  Find the code on the Netflix github account and embrace the chaos!

-Chaos Engineering Team at Netflix
Lorin Hochstein, Casey Rosenthal

Netflix at RecSys 2016 - Recap

$
0
0

A key aspect of Netflix is providing our members with a personalized experience so they can easily find great stories to enjoy. A collection of recommender systems drive the main aspects of this personalized experience and we continuously work on researching and testing new ways to make them better. As such, we were delighted to sponsor and participate in this year’s ACM Conference on Recommender Systems in Boston, which marked the 10th anniversary of the conference. For those who couldn’t attend or want more information, here is a recap of our talks and papers at the conference.

Justin and Yves gave a talk titled “Recommending for the World” on how we prepared our algorithms to work world-wide ahead of our global launch earlier this year. You can also read more about it in our previous blogposts.



Justin also teamed up with Xavier Amatriain, formerly at Netflix and now at Quora, in the special Past, Present, and Future track to offer an industry perspective on what the future of recommender systems in industry may be.



Chao-Yuan Wu presented a paper he authored last year while at Netflix, on how to use navigation information to adapt recommendations within a session as you learn more about user intent.



Yves also shared some pitfalls of distributed learning at the Large Scale Recommender Systems workshop.



Hossein Taghavi gave a presentation at the RecSysTV workshop on trying to balance discovery and continuation in recommendations, which is also the subject of a recent blog post.



Dawen Liang presented some research he conducted prior to joining Netflix on combining matrix factorization and item embedding.



If you are interested in pushing the frontier forward in the recommender systems space, take a look at some of our relevant open positions!
Viewing all 305 articles
Browse latest View live