Hi, it's Matt Seeley, engineer on the device UI team at Netflix. My team uses WebKit, JavaScript, HTML5 and CSS3 to build user interfaces for the PlayStation 3, Wii, Blu-ray players, Internet-connected TVs, phones and tablets.

Recently I spoke at the HTML5 Dev Conf about WebKit-based UI development on consumer electronics. I discussed:

Responding to user input quickly while deferring expensive user interface updates
Managing main and video memory footprint of a single page application
Understanding WebKit's interpretation and response to changes in HTML and CSS
Achieving highest possible animation frame rates using accelerated compositing

Watch the video presentation from the conference:

Slides are also available in PDF

Astute readers will realize that portions of the content are also suitable for mobile and desktop development. It's all about building great user interfaces that take best possible advantage of the device.

Interested in joining our team? We're hiring!

by Shashi Madappa

In most applications there is some amount of data that will be frequently used. Some of this data is transient and can be recalculated, while other data will need to be fetched from the database or a middle tier service. In the Netflix cloud architecture we use caching extensively to offset some of these operations. This document details Netflix’s implementation of a highly scalable memcache-based caching solution, internally referred to as EVCache.

Why do we need Caching?

Some of the objectives of the Cloud initiative were

Faster response time compared to Netflix data center based solution

Session based App in data center to Stateless without sessions in the cloud

Use NoSQL based persistence like Cassandra/SimpleDB/S3

To solve these we needed the ability to store data in a cache that was Fast, Shared and Scalable. We use cache to front the data that is computed or retrieved from a persistence store like Cassandra, or other Amazon AWS’ services like S3 and SimpleDB and they can take several hundred milliseconds at the 99th percentile, thus causing a widely variable user experience. By fronting this data with a cache, the access times would be much faster & linear and the load on these datastores would be greatly reduced. Caching also enables us to respond to sudden request spikes more effectively. Additionally, an overloaded service can often return a prior cached response; this ensures that user gets a personalized response instead of a generic response. By using caching effectively we have reduced the total cost of operation.

What is EVCache?

EVCache is a memcached & spymemcached based caching solution that is well integrated with Netflix and AWS EC2 infrastructure.

EVCache is an abbreviation for:
Ephemeral - The data stored is for a short duration as specified by its TTL1 (Time To Live).
Volatile - The data can disappear any time (Evicted2).
Cache – An in-memory key-value store.

How is it used?

We have over 25 different use cases of EVCache within Netflix. A particular use case is a users Home Page. For Ex, to decide which Rows to show to a particular user, the algorithm needs to know the Users Taste, Movie Viewing History, Queue, Rating, etc. This data is fetched from various services in parallel and is fronted using EVCache by these services.

Features

We will now detail the features including both add-ons by Netflix and those that come with memcache.

Overview

Distributed Key-Value store, i.e., the cache is spread across multiple instances
AWS Zone-Aware and data can be replicated across zones.
Registers and works with Netflix’s internal Naming Service for automatic discovery of new nodes/services.
To store the data, Key has to be a non-null String and value can be a non-null byte-array, primitives, or serializable object. Value should be less than 1 MB.
As a generic cache cluster that can be used across various applications, it supports an optional Cache Name, to be used as namespace to avoid key collisions.
Typical cache hit rates are above 99%.
Works well with Netflix Persister Framework7. For E.g., In-memory ->backed by EVCache -> backed by Cassandra/SimpleDB/S3

Elasticity and deployment ease: EVCache is linearly scalable. We monitor capacity and can add capacity within a minute and potentially re-balance and warm data in the new node within a few minutes. Note that we have pretty good capacity modeling in place and so capacity change is not something we do very frequently but we have good ways of adding capacity while actively managing the cache hit rate. Stay tuned for more on this scalable cache warmer in an upcoming blog post.
Latency: Typical response time in low milliseconds. Reads from EVCache are typically served back from within the same AWS zone. A nice side effect of zone affinity is that we don’t have any data transfer fees for reads.
Inconsistency: This is a Best Effort Cache and the data can get inconsistent. The architecture we have chosen is speed instead of consistency and the applications that depend on EVCache are capable of handling any inconsistency. For data that is stored for a short duration, TTL ensures that the inconsistent data expires and for the data that is stored for a longer duration we have built consistency checkers that repairs it.
Availability: Typically, the cluster never goes down as they are spread across multiple Amazon Availability Zones. When instances do go down occasionally, cache misses are minimal as we use consistent hashing to shard the data across the cluster.
Total Cost of Operations: Beyond the very low cost of operating the EVCache cluster, one has to be aware that cache misses are generally much costlier - the cost of accessing services AWS SimpleDB, AWS S3, and (to a lesser degree) Cassandra on EC2, must be factored in as well. We are happy with the overall cost of operations of EVCache clusters which are highly stable, linearly scalable.

Under the Hood

Server: The Server consist of the following:

memcached server.
Java Sidecar - A Java app that communicates with the Discovery Service6( Name Server) and hosts admin pages.
Various apps that monitor the health of the services and report stats.

Client: A Java client discovers EVCache servers and manages all the CRUD3 (Create, Read, Update & Delete) operations. The client automatically handles the case when servers are added to or removed from the cluster. The client replicates data (AWS Zone5 based) during Create, Update & Delete Operations; on the other hand, for Read operations the client gets the data from the server which is in the same zone as the client.

We will be open sourcing this Java client sometime later this year so we can share more of our learnings with the developer community.

Single Zone Deployment

The figure below image illustrates the scenario in AWS US-EAST Region4 and Zone-A where an EVCache cluster with 3 instances has a Web Application performing CRUD operations (on the EVcache system).

Upon startup, an EVCache Server instance registers with the Naming Service6 (Netflix’s internal name service that contains all the hosts that we run).
During startup of the Web App, the EVCache Client library is initialized which looks up for all the EVCache server instances registered with the Naming Services and establishes a connection with them.
When the Web App needs to perform CRUD operation for a key the EVCache client selects the instance on which these operations can be performed. We use Consistent Hashing to shard the data across the cluster.

Multi-Zone Deployment

The figure below illustrates the scenario where we have replication across multiple zones in AWS US-EAST Region. It has an EVCache cluster with 3 instances and a Web App in Zone-A and Zone-B.

Upon startup, an EVCache Server instance in Zone-A registers with the Naming Service in Zone-A and Zone-B.
During the startup of the Web App in Zone-A , The Web App initializes the EVCache Client library which looks up for all the EVCache server instances registered with the Naming Service and connects to them across all Zones.
When the Web App in Zone-A needs to Read the data for a key, the EVCache client looks up the EVCache Server instance in Zone –A which stores this data and fetches the data from this instance.
When the Web App in Zone-A needs to Write or Delete the data for a key, the EVCache client looks up the EVCache Server instances in Zone–A and Zone-B and writes or deletes it.

Case Study : Movie and TV show similarity

One of the applications that uses caching heavily is the Similars application. This application suggests Movies and TV Shows that have similarities to each other. Once the similarities are calculated they are persisted in SimpleDB/S3 and are fronted using EVCache. When any service, application or algorithm needs this data it is retrieved from the EVCache and result is returned.

A Client sends a request to the WebApp requesting a page and the algorithm that is processing this requests needs similars for a Movie to compute this data.
The WebApp that needs a list of similars for a Movie or TV show looks up EVCache for this data. Typical cache hit rate is above 99.9%.
If there is a cache miss then the WebApp calls the Similars App to compute this data.
If the data was previously computed but missing in the cache then Similars App will read it from SimpleDB. If it were missing in SimpleDB then the app Calculates the similars for the given Movie or TV show.
This computed data for the Movie or TV Show is then written to EVCache.
The Similars App then computes the response needed by the client and returns it to the client.

Metrics, Monitoring, and Administration

Administration of the various clusters is centralized and all the admin & monitoring of the cluster and instances can be performed via web illustrated below.

The server view below shows the details of each instance in the cluster and also rolls up by the stats for the zone. Using this tool the contents of a memcached slab can be viewed

The EVCache Clusters currently serve over 200K Requests/sec at peak loads. The below chart shows number of requests to EVCache every hour.

The average latency is around 1 millisecond to 5 millisecond. The 99th percentile is around 20 millisecond.

Typical cache hit rates are above 99%.

Join Us

Like what you see and want to work on bleeding edge performance and scale?
We’re hiring !

References

TTL : Time To Live for data stored in the cache. After this time the data will expire and when requested will not be returned.
Evicted : Data associated with a key can be evicted(removed) from the cache even though its TTL has not yet exceed. This happens when the cache is running low on memory and it needs to make some space to add the new data that we are storing. The eviction is based on LRU (Least Recently Used).
CRUD : Create, Read, Update and Delete are the basic functions of storage.
AWS Region : It is a Geographical region and currently in US East (virginia), US West, EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo) and South America (Sao Palo).
AWS Zone: Each availability zone runs on its own physically distinct and independent infrastructure. You can also think this as a data center.
Naming Service : It is a service developed by Netflix and is a registery for all the instances that run Netflix Services.
Netflix Persister Framework : A Framework developed by Netflix that helps user to persist data across various datastore like In-Memory/EVCache/Cassandra/SimpleDB/S3 by providing a simple API.

by Shashi Madappa, Senior Software Engineer, Personalization Infrastructure Team

By Eran Landau

Over the past year we have been investing heavily in Cassandra as our primary persistent storage solution. We currently run 55 separate clusters, ranging from 6 to 48 nodes. We've been active contributors to Cassandra and have developed additional tools and our own client library. Today we are open sourcing that client, Astyanax, as part of our ongoing open source initiative. Astyanax started as a re-factor of Hector, but our experience running with a large number of diverse clusters has enabled us to tune the client for various scenarios and focus on wide range of client use cases.

What is Astyanax?

Astyanax is a Java Cassandra client. It borrows many concepts from Hector but diverges in the connection pool implementation as well as the client API. One of the main design considerations was to provide a clean abstraction between the connection pool and Cassandra API so that each may be customized and improved separately. Astyanax provides a fluent style API which guides the caller to narrow the query from key to column as well as providing queries for more complex use cases that we have encountered. The operational benefits of Astyanax over Hector include lower latency, reduced latency variance, and better error handling.

Astyanax is broken up into three separate parts:

connection pool	The connection pool abstraction and several implementations including round robin, token aware and bag of connections.
cassandra-thrift API implementation	Cassandra Keyspace and Cluster level APIs implementing the thrift interface.
recipes and utilities	Utility classes built on top of the astyanax-cassandra-thrift API.

Astyanax API

Astyanax implements a fluent API which guides the caller to narrow or customize the query via a set of well defined interfaces. We've also included some recipes that will be executed efficiently and as close to the low level RPC layer as possible. The client also makes heavy use of generics and overloading to almost eliminate the need to specify serializers.

Some key features of the API include:

Key and column types are defined in a ColumnFamily class which eliminates the need to specify serializers.
Multiple column family key types in the same keyspace.
Annotation based composite column names.
Automatic pagination.
Parallelized queries that are token aware.
Configurable consistency level per operation.
Configurable retry policy per operation.
Pin operations to specific node.
Async operations with a single timeout using Futures.
Simple annotation based object mapping.
Operation result returns host, latency, attempt count.
Tracer interfaces to log custom events for operation failure and success.
Optimized batch mutation.
Completely hide the clock for the caller, but provide hooks to customize it.
Simple CQL support.
RangeBuilders to simplify constructing simple as well as composite column ranges.
Composite builders to simplify creating composite column names.

Recipes

Recipes for some common use cases:

CSV importer.
JSON exporter to convert any query result to JSON with a wide range of customizations.
Parallel reverse index search.
Key unique constraint validation.

Connection pool

The Astyanax connection pool was designed to provide a complete abstraction from the client API layer. One of our main goals when preparing Astyanax to be open sourced was to properly decouple components of the connection pool so that others may easily plug in their customizations. For example, we have our own middle tier load balancer that keeps track of nodes in the cluster and have made it the source of seed nodes to the client.

Key features of the connection pool are:

HostSupplier/NodeAutoDiscovery	Background task that frequently refreshes the client host list. There is also an implementation that consolidates a describe ring against the local region's list of hosts to prevent cross-regional client traffic.
TokenAware	The token aware connection pool implementation keeps track of which hosts own which tokens and intelligently directs traffic to a specific range with fallback to round robin.
RoundRobin	This is a standard round robin implementation
Bag	We found that for our shared cluster we needed to limit the number of client connections. This connection pool opens a limited number of connections to random hosts in a ring.
ExecuteWithFailover	This abstraction lets the connection pool implementation capture a fail-over context efficiently
RetryPolicy	Retry on top of the normal fail-over in ExecuteWithFailover. Fail-over addresses problems such as connections not available on a host, a host going down in the middle of an operation or timeouts. Retry implements backoff and retrying the entire operation with an entirely new context.
BadHostDetector	Determine when a node has gone down, based on timeouts
LatencyScoreStrategy	Algorithm to determine a node's score based on latency. Two modes, unordered (round robin) ordered (best score gets priority)
ConnectionPoolMonitor	Monitors all events in the connection pool and can tie into proprietary monitoring mechanisms. We found that logging deep in the connection pool code tended to be very verbose so we funnel all events to a monitoring interface so that logging and alerting may be controlled externally.
Pluggable real-time configuration	The connection pool configuration is kept by a single object referenced throughout the code. For our internal implementation this configuration object is tied to volatile properties that may change at run time and are picked up by the connection pool immediately thereby allowing us to tweak client performance at runtime without having to restart.
RateLimiter	Limits the number of connections that can be opened within a given time window. We found this necessary for certain types of network outages that cause thundering herd of connection attempts overwhelming Cassandra.

A taste of Astyanax

Here's a brief code snippet to give you a taste of what the API looks like

Accessing Astyanax

The Astyanax binaries are posted to Maven Central which makes accessing them very easy
The source code for Astyanax is hosted at Github: https://github.com/Netflix/astyanax
Extensive documentation is currently at https://github.com/Netflix/astyanax/wiki/Getting-Started

By Brian Harrington & Greg Orzell

In a previous blog post about auto scaling, I mentioned that we would be open sourcing the library that we use to expose application metrics. Servo is that library. It is designed to make it easy for developers to export metrics from their application code, register them with JMX, and publish them to external monitoring systems such as Amazon's CloudWatch. This is especially important at Netflix because we are a data driven company and it is essential that we know what is going on inside our applications in near real time. As we increased our use of auto scaling based on application load, it became important for us to be able to publish custom metrics to CloudWatch so that we could configure auto-scaling policies based on the metrics that most accurately capture the load for a given application. We already had the servo framework in place to publish data to our internal monitoring system, so it was extended to allow for exporting a subset (AWS charges on a per metric basis) of the data into CloudWatch.

Features

Simple: It is trivial to expose and publish metrics without having to write lots of code such as MBean interfaces.
JMX Registration: JMX is the standard monitoring interface for Java and can be queried by many existing tools. Servo makes it easy to expose metrics to JMX so they can be viewed from a wide variety of Java tools such as VisualVM.
Flexible publishing: Once metrics are exposed, it should be easy to regularly poll the metrics and make them available for internal reporting systems, logs, and services like Amazon's CloudWatch. There is also support for filtering to reduce cost for systems that charge per metric, and asynchronous publishing to help isolate the collection from downstream systems that can have unpredictable latency.

The rest of this post provides a quick preview of Servo, for a more detailed overview see the Servo wiki.

Registering Metrics

Registering metrics is designed to be both easy and flexible. Using annotations you can call out the fields or methods that should be monitored for a class and specify both static and dynamic metadata. The example below shows a basic server class with some stats about the number of connections and amount of data that as been seen.

See the annotations wiki page for a more detailed summary of the available annotations and the options that are available. Once you have annotated your class, you will need to register each new object instance with the registry in order for the metrics to get exposed. A default registry is provided that exports metrics to JMX.

Now that the instance is registered metrics should be visible in tools like VisualVM when you run your application.

Publishing Metrics

After getting into JMX, the next step is to collect the data and make it available to other systems. The servo library provides three main interfaces for collecting and publishing data:

MetricObserver: an observer is a class that accepts updates to the metric values. Implementations are provided for keeping samples in memory, exporting to files, and exporting to CloudWatch.
MetricPoller: a poller provides a way to collect metrics from a given source. Implementations are provided for querying metrics associated with a monitor registry and arbitrary metrics exposed to JMX.
MetricFilter: filters are used to restrict the set of metrics that are polled. The filter is passed in to the poll method call so that metrics that can be expensive to collect, will be ignored as soon as possible. Implementations are provided for filtering based on a regular expression and prefixes such as package names.

The example below shows how to configure the collection of metrics each minute to store them on the local file system.

By simply using a different observer, we can instead export the metrics to a monitoring system like CloudWatch.

You have to provide your AWS credentials and namespace at initialization. Servo also provides some helpers for tagging the metrics with common dimensions such as the auto scaling group and instance id. CloudWatch data can be retrieved using the standard Amazon tools and APIs.

Related Links

Servo Project
Servo Documentation
Netflix Open Source

By Praveen Sadhu, Vijay Parthasarathy & Aditya Jami

We talked in the past about our move to NoSQL and Cassandra has been a big part of that strategy. Cassandra hit a big milestone recently with the announcement of the v1 release. We recently announced Astyanax, Netflix's Java Cassandra client with an improved API and connections management which we open sourced last month.

Today, we're excited to announce another milestone on our open source journey with an addition to make operations and management of Cassandra easier and more automated.

As we embarked on making Cassandra one of our NoSQL databases in the cloud, we needed tools for managing configuration, providing reliable and automated backup/recovery, and automating token assignment within and across regions. Priam was built to meet these needs. The name 'Priam' refers to the king of Troy, in Greek mythology, who was the father of Cassandra.

What is Priam?

Priam is a co-process that runs alongside Cassandra on every node to provide the following functionality:

Backup and recovery
Bootstrapping and automated token assignment.
Centralized configuration management
RESTful monitoring and metrics

We are currently using Priam to manage several dozen Cassandra clusters and counting.

Backup and recovery

A dependable backup and recovery process is critical when choosing to run a database in the cloud. With Priam, a daily snapshot and incremental data for all our clusters is backed up to S3. S3 was an obvious choice for backup data due to its simple interface and ability to access any amount of data from anywhere[1].

Snapshot backup

Priam leverages Cassandra's snapshot feature to have an eventually consistent backup[2]. Cassandra flushes data to disk and hard-links all SSTable files (data files) into a snapshot directory. SSTables are immutable files and can be safely copied to an external source. Priam picks up these hard-linked files and uploads them to S3. Snapshots are run on a daily basis for the entire cluster, ideally during non-peak hours. Although snapshot across cluster is not guaranteed to produce a consistent backup of cluster, consistency is recovered upon restore by Cassandra and running repairs. Snapshots can also be triggered on demand via Priam's REST API during upgrades and maintenance operations.

During the backup process, Priam throttles the data read from disk to avoid contention and interference with Cassandra's disk IO as well as network traffic. Schema files are also backed up in the process.

Incremental backup

When incremental backups are enabled in Cassandra, hard-links are created for all new SSTables in the incremental backup directory. Priam scans this directory frequently for incremental SSTable files and uploads them to S3. Incremental data along with the snapshot data are required for a complete backup.

Compression and multipart uploading

Priam uses snappy compression to compress SSTables on the fly. With S3's multi-part upload feature, files are chunked, compressed and uploaded in parallel. Uploads also ensure the file cache is unaffected (set via: fadvise). Priam reliably handles file sizes on the order of several hundred GB in our production environment for several of our clusters.

Restoring data

Priam supports restoring a partial or complete ring. Although the latter is less likely in production, restoring to a full test cluster is a common use case. When restoring data from backup, the Priam process (on each node) locates snapshot files for some or all keyspaces and orchestrates the download of the snapshot, incremental backup files, and starting of the cluster. During this process, Priam strips the ring information from the backup, allowing us to restore to a cluster of half the original size (i.e., by skipping alternate nodes and running repair to regain skipped data). Restoring to a different sized cluster is possible only for the keyspaces with replication factor more than one. Priam can also restore data to clusters with different names allowing us to spin up multiple test clusters with the same data.

Restoring prod data for testing

Using production data in a test environment allows you to test on massive volumes of real data to produce realistic benchmarks. One of the goals of Priam was to automate restoration of data into a test cluster. In fact, at Netflix, we bring up test clusters on-demand by pointing them to a snapshot. This also provides a mechanism for validating production data and offline analysis. SSTables are also used directly by our ETL process.

Figure 1: Listing of backup files in S3

Token Assignment

Priam automates the assignment of tokens to Cassandra nodes as they are added, removed or replaced in the ring. Priam relies on centralized external storage (SimpleDB/Cassandra) for storing token and membership information, which is used to bootstrap nodes into the cluster. It allows us to automate replacing nodes without any manual intervention, since we assume failure of nodes, and create failures using Chaos Monkey. The external Priam storage also provides us valuable information for the backup and recovery process.

To survive failures in the AWS environment and provide high availability, we spread our Cassandra nodes across multiple availability zones within regions. Priam's token assignment feature uses the locality information to allocate tokens across zones in an interlaced manner.

One of the challenges with cloud environments is replacing ephemeral nodes which can get terminated without warning. With token information stored in an external datastore (SimpleDB), Priam automates the replacement of dead or misbehaving (due to hardware issues) nodes without requiring any manual intervention.

Priam also lets us add capacity to existing clusters by doubling them. Clusters are doubled by interleaving new tokens between the existing ones. Priam's ring doubling feature does strategic replica placement to make sure that the original principle of having one replica per zone is still valid (when using a replication factor of at least 3).

We are also working closely with the Cassandra community to automate and enhance the token assignment mechanisms for Cassandra.

Address         DC          Rack        Status State   Load            Owns    Token                   
                                                            167778111467962714624302757832749846470
10.XX.XXX.XX   us-east     1a          Up     Normal  628.07 GB       1.39%   1808575600               
10.XX.XXX.XX   us-east     1d          Up     Normal  491.85 GB       1.39%   2363071992506517107384545886751410400
10.XX.XXX.XX   us-east     1c          Up     Normal  519.49 GB       1.39%   4726143985013034214769091771694245202
10.XX.XXX.XX   us-east     1a          Up     Normal  507.48 GB       1.39%   7089215977519551322153637656637080002
10.XX.XXX.XX   us-east     1d          Up     Normal  503.12 GB       1.39%   9452287970026068429538183541579914805
10.XX.XXX.XX   us-east     1c          Up     Normal  508.85 GB       1.39%   11815359962532585536922729426522749604
10.XX.XXX.XX   us-east     1a          Up     Normal  497.69 GB       1.39%   14178431955039102644307275311465584408
10.XX.XXX.XX   us-east     1d          Up     Normal  495.2 GB        1.39%   16541503947545619751691821196408419206
10.XX.XXX.XX   us-east     1c          Up     Normal  503.94 GB       1.39%   18904575940052136859076367081351254011
10.XX.XXX.XX   us-east     1a          Up     Normal  624.87 GB       1.39%   21267647932558653966460912966294088808
10.XX.XXX.XX   us-east     1d          Up     Normal  498.78 GB       1.39%   23630719925065171073845458851236923614
10.XX.XXX.XX   us-east     1c          Up     Normal  506.46 GB       1.39%   25993791917571688181230004736179758410
10.XX.XXX.XX   us-east     1a          Up     Normal  501.05 GB       1.39%   28356863910078205288614550621122593217
10.XX.XXX.XX   us-east     1d          Up     Normal  814.26 GB       1.39%   30719935902584722395999096506065428012
10.XX.XXX.XX   us-east     1c          Up     Normal  504.83 GB       1.39%   33083007895091239503383642391008262820

Figure 2: Sample cluster created by Priam

Multi-regional clusters

For multi-regional clusters, Priam allocates tokens by interlacing them between regions. Apart from allocating tokens, Priam provides a seed list across regions for Cassandra and automates security group updates for secure cross-regional communications between nodes. We use Cassandra's inter-DC encryption mechanism to encrypt the data between the regions via public Internet. As we span across multiple AWS regions, we rely heavily on these features to bring up new Cassandra clusters within minutes.

In order to have a balanced multi-regional cluster, we place one replica in each zone, and across all regions. Priam does this by calculating tokens for each region and padding them with a constant value. For example, US-East will start with token 0 where as EU-West will start with token 0 + x. This allows us to have different size clusters in each regions depending on the usage.

When a mutation is performed on a cluster, Cassandra writes to the local nodes and forwards the write asynchronously to the other regions. By placing replicas in a particular order we can actually withstand zone failures or a region failures without most of our services knowing about them. In the below diagram A1, A2 and A3 are AWS availability zones in one region and B1, B2 and B3 are AWS availability zones in another region.

Note: Replication Factor and the number of nodes in a zone are independent settings for each region.

Figure 3: Write to Cassandra spanning region A and region B

Address         DC          Rack        Status State   Load            Owns    Token                   
                                                            170141183460469231731687303715884105727
176.XX.XXX.XX    eu-west     1a          Up     Normal  35.04 GB        0.00%   372748112               
184.XX.XXX.XX    us-east     1a          Up     Normal  56.33 GB        8.33%   14178431955039102644307275309657008810
46.XX.XXX.XX     eu-west     1b          Up     Normal  36.64 GB        0.00%   14178431955039102644307275310029756921
174.XX.XXX.XX    us-east     1d          Up     Normal  34.63 GB        8.33%   28356863910078205288614550619314017620
46.XX.XXX.XX     eu-west     1c          Up     Normal  51.82 GB        0.00%   28356863910078205288614550619686765731
50.XX.XXX.XX     us-east     1c          Up     Normal  34.26 GB        8.33%   42535295865117307932921825928971026430
46.XX.XXX.XX     eu-west     1a          Up     Normal  34.39 GB        0.00%   42535295865117307932921825929343774541
184.XX.XXX.XX    us-east     1a          Up     Normal  56.02 GB        8.33%   56713727820156410577229101238628035240
46.XX.XXX.XX     eu-west     1b          Up     Normal  44.55 GB        0.00%   56713727820156410577229101239000783351
107.XX.XXX.XX    us-east     1d          Up     Normal  36.34 GB        8.33%   70892159775195513221536376548285044050
46.XX.XXX.XX     eu-west     1c          Up     Normal  50.1 GB         0.00%   70892159775195513221536376548657792161
50.XX.XXX.XX     us-east     1c          Up     Normal  39.44 GB        8.33%   85070591730234615865843651857942052858
46.XX.XXX.XX     eu-west     1a          Up     Normal  40.86 GB        0.00%   85070591730234615865843651858314800971
174.XX.XXX.XX    us-east     1a          Up     Normal  43.75 GB        8.33%   99249023685273718510150927167599061670
79.XX.XXX.XX     eu-west     1b          Up     Normal  42.43 GB        0.00%   99249023685273718510150927167971809781

Figure 4: Multi-regional Cassandra cluster created by Priam

All our clusters are centrally configured via properties stored in SimpleDB, which includes setup of critical JVM settings and Cassandra YAML properties.

Priam's REST API

One of goals of Priam was to support managing multiple Cassandra clusters. To achieve that, Priam's REST APIs provides hooks that support external monitoring and automation scripts. They provide the ability to backup, restore a set of nodes manually and provide insights into Cassandra's ring information. They also expose key Cassandra JMX commands such as repair and refresh.

For comprehensive listing of the APIs, please visit the github wiki here.

Key Cassandra facts at Netflix:

57 Cassandra clusters running on hundreds of instances are currently in production, many of which are multi-regional
Priam backs up tens of TBs of data to S3 per day.
Several TBs of production data is restored into our test environment every week.
Nodes get replaced almost daily without any manual intervention
All of our clusters use random partitioner and are well-balanced
Priam was used to create the 288 node Cassandra benchmark cluster discussed in our earlier blog post[3].

Related Links:

By Charles Smith and Jeff Magnusson

Our job in Data Science and Engineering is to consume all the data that Netflix produces and provide an offline batch-processing platform for analyzing and enriching it. As has been mentioned in previous posts, Netflix has recently been engaged in making the transition to serving a significant amount of data from Cassandra. As with any new data storage technology that is not easily married to our current analytics and reporting platforms, we needed a way to provide a robust set of tools to process and access the data.

With respect to the requirement of bulk processing, there are a couple very basic problems that we need to avoid when acquiring data. First, we don’t want to impact production systems. Second, Netflix is creating an internal infrastructure of decoupled applications, several of which are backed by their own Cassandra cluster. Data Science and Engineering needs to be able to obtain a consistent view of the data across the various clusters.

With these needs in mind and many of our key data sources rapidly migrating from traditional relational database systems into Cassandra, we set out to design a process to extract data from Cassandra and make it available in a generic form that is easily consumable by our bulk analytics platform. Since our desire is to retrieve the data in bulk, we rejected any attempts to query the production clusters directly. While Cassandra is very efficient at serving point queries and we have a lot of great APIs for accessing data here at Netflix, trying to ask a system for all of its data is generally not good for its long or short-term health.

Instead, we wanted to build an offline batch process for extracting the data. A big advantage to hosting our infrastructure on AWS is that we have access to effectively infinite, shared storage on S3. Our production Cassandra clusters are continuously backed up into a S3 bucket using our backup and recovery tool, Priam. Initially we intended to simply bring up a copy of each production cluster from Priam’s backups and extract the data via a Hadoop map-reduce job running against the restored Cassandra cluster. Working forward from that approach, we soon discovered that while it may be a feasible for one or two clusters, maintaining the number of moving parts required to deploy this solution to all of our production clusters was going to quickly become unmaintainable. It just didn’t scale.

So, is there a better way to do it?

Taking at step back, it became evident to us that the problem of achieving scale in this architecture was two-fold. First, the overhead of spinning up a new cluster in AWS and restoring it from a backup did not scale well with the number of clusters being pushed to production. Second, we were operating under the constraint that backups have to be restored into a cluster equal in size to production. As data sizes grow, there is not necessarily any motivation for production data sources to increase the number of nodes in their clusters (remember, they are not bulk querying the data – their workloads don’t scale linearly with respect to data size).

Thus, we were unable to leverage a key benefit of processing data on Hadoop – the ability to easily scale computing resources horizontally with respect to the size of your data.

We realized that Hadoop was an excellent fit for processing the data. The problem was that the Cassandra data was stored in a format that was not natively readable (sstables). Standing up Cassandra clusters from backups was simply a way to circumvent that problem. Rather than try to avoid the real issue, we decided to attack it head-on.

Aegisthus is Born

The end result was an application consisting of a constantly running Hadoop cluster capable of processing sstables as they are created by any Cassandra data source in our architecture. We call it Aegisthus, named in honor of Aegisthus’s relationship with Cassandra in Greek mythology.

Running on a single Hadoop cluster gives us the advantage of being able to easily and elastically scale a single computing resource. We were able to reduce the number of moving parts in our architecture while vastly increasing the speed at which we could process Cassandra’s data.

How it Works

A single map-reduce job is responsible for the bulk of the processing Aegisthus performs. The inputs to this map reduce job are sstables from Cassandra – either a full snapshot of a cluster (backup), or batches of sstables as they are incrementally archived by Priam from the Cassandra cluster. We process the sstables, reduce them into a single consistent view of the data, and convert it to JSON formatted rows that we stage out to S3 to be picked up by the rest of our analytics pipeline.

A full snapshot of a Cassandra cluster consists of all the sstables required to reconstitute the data into a new cluster that is consistent to the point at which the snapshot was taken. We developed an input format that is able to split the sstables across the entire Hadoop cluster, allowing us to control the amount of compute power we want to throw at the processing (horizontal scale). This was a welcome relief after trying to deal with timeout exceptions when directly using Cassandra as a data source for Hadoop input.

Each row-column value written to Cassandra is replicated and stored in sstables with a corresponding timestamp. The map phase of our job reads the sstables and converts them into JSON format. The reduce phase replicates the internal logic Cassandra uses to return data when it is queried with a consistency level of ALL (i.e. it reduces each row-column value based on the max timestamp).

Performance

Aegisthus is currently running on a Hadoop cluster consisting of 24 m2.4xlarge EC2 instances. In the table below, we show some benchmarks for a subset of the Cassandra clusters from which we are pulling data. The table below shows the number of nodes in the Cassandra cluster that houses the data, the average size of data per node, the total number of rows in the dataset, the time our map/reduce job takes to run, and the number or row/sec we are able to process.

The number of rows/sec is highly variable across data sources. This is due to a number of reasons, notably the average size of the rows and the average number of times a row is replicated across the sstables. Further, as can be seen in the last entry in the table, smaller datasets incur a noticeable penalty due to overhead in the map/reduce framework.

We’re constantly optimizing our process and have tons of other interesting and challenging problems to solve. Like what you see? We’re hiring!

by Ben Christensen

In an earlier post by Ben Schmaus, we shared the principles behind our circuit-breaker implementation. In that post, Ben discusses how the Netflix API interacts with dozens of systems in our service-oriented architecture, which makes the API inherently more vulnerable to any system failures or latencies underneath it in the stack. The rest of this post provides a more technical deep-dive into how our API and other systems isolate failure, shed load and remain resilient to failures.

Fault Tolerance is a Requirement, Not a Feature

The Netflix API receives more than 1 billion incoming calls per day which in turn fans out to several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying subsystems with peaks of over 100k dependency requests per second.

This all occurs in the cloud across thousands of EC2 instances.

Intermittent failure is guaranteed with this many variables, even if every dependency itself has excellent availability and uptime.

Without taking steps to ensure fault tolerance, 30 dependencies each with 99.99% uptime would result in 2+ hours downtime/month (99.99%³⁰ = 99.7% uptime = 2+ hours in a month).

When a single API dependency fails at high volume with increased latency (causing blocked request threads) it can rapidly (seconds or sub-second) saturate all available Tomcat (or other container such as Jetty) request threads and take down the entire API.

Thus, it is a requirement of high volume, high availability applications to build fault tolerance into their architecture and not expect infrastructure to solve it for them.

Netflix DependencyCommand Implementation

The service-oriented architecture at Netflix allows each team freedom to choose the best transport protocols and formats (XML, JSON, Thrift, Protocol Buffers, etc) for their needs so these approaches may vary across services.

In most cases the team providing a service also distributes a Java client library.

Because of this, applications such as API in effect treat the underlying dependencies as 3rd party client libraries whose implementations are "black boxes". This in turn affects how fault tolerance is achieved.

In light of the above architectural considerations we chose to implement a solution that uses a combination of fault tolerance approaches:

network timeouts and retries
separate threads on per-dependency thread pools
semaphores (via a tryAcquire, not a blocking call)
circuit breakers

Each of these approaches to fault-tolerance has pros and cons but when combined together provide a comprehensive protective barrier between user requests and underlying dependencies.

The Netflix DependencyCommand implementation wraps a network-bound dependency call with a preference towards executing in a separate thread and defines fallback logic which gets executed (step 8 in flow chart below) for any failure or rejection (steps 3, 4, 5a, 6b below) regardless of which type of fault tolerance (network or thread timeout, thread pool or semaphore rejection, circuit breaker) triggered it.

Click to enlarge

We decided that the benefits of isolating dependency calls into separate threads outweighs the drawbacks (in most cases). Also, since the API is progressively moving towards increased concurrency it was a win-win to achieve both fault tolerance and performance gains through concurrency with the same solution. In other words, the overhead of separate threads is being turned into a positive in many use cases by leveraging the concurrency to execute calls in parallel and speed up delivery of the Netflix experience to users.

Thus, most dependency calls now route through a separate thread-pool as the following diagram illustrates:

If a dependency becomes latent (the worst-case type of failure for a subsystem) it can saturate all of the threads in its own thread pool, but Tomcat request threads will timeout or be rejected immediately rather than blocking.

Click to enlarge

In addition to the isolation benefits and concurrent execution of dependency calls we have also leveraged the separate threads to enable request collapsing (automatic batching) to increase overall efficiency and reduce user request latencies.

Semaphores are used instead of threads for dependency executions known to not perform network calls (such as those only doing in-memory cache lookups) since the overhead of a separate thread is too high for these types of operations.

We also use semaphores to protect against non-trusted fallbacks. Each DependencyCommand is able to define a fallback function (discussed more below) which is performed on the calling user thread and should not perform network calls. Instead of trusting that all implementations will correctly abide to this contract, it too is protected by a semaphore so that if an implementation is done that involves a network call and becomes latent, the fallback itself won't be able to take down the entire app as it will be limited in how many threads it will be able to block.

Despite the use of separate threads with timeouts, we continue to aggressively set timeouts and retries at the network level (through interaction with client library owners, monitoring, audits etc).

The timeouts at the DependencyCommand threading level are the first line of defense regardless of how the underlying dependency client is configured or behaving but the network timeouts are still important otherwise highly latent network calls could fill the dependency thread-pool indefinitely.

The tripping of circuits kicks in when a DependencyCommand has passed a certain threshold of error (such as 50% error rate in a 10 second period) and will then reject all requests until health checks succeed.

This is used primarily to release the pressure on underlying systems (i.e. shed load) when they are having issues and reduce the user request latency by failing fast (or returning a fallback) when we know it is likely to fail instead of making every user request wait for the timeout to occur.

How do we respond to a user request when failure occurs?

In each of the options described above a timeout, thread-pool or semaphore rejection, or short-circuit will result in a request not retrieving the optimal response for our customers.

An immediate failure ("fail fast") throws an exception which causes the app to shed load until the dependency returns to health. This is preferable to requests "piling up" as it keeps Tomcat request threads available to serve requests from healthy dependencies and enables rapid recovery once failed dependencies recover.

However, there are often several preferable options for providing responses in a "fallback mode" to reduce impact of failure on users. Regardless of what causes a failure and how it is intercepted (timeout, rejection, short-circuited etc) the request will always pass through the fallback logic (step 8 in flow chart above) before returning to the user to give a DependencyCommand the opportunity to do something other than "fail fast".

Some approaches to fallbacks we use are, in order of their impact on the user experience:

Cache: Retrieve data from local or remote caches if the realtime dependency is unavailable, even if the data ends up being stale
Eventual Consistency: Queue writes (such as in SQS) to be persisted once the dependency is available again
Stubbed Data: Revert to default values when personalized options can't be retrieved
Empty Response ("Fail Silent"): Return a null or empty list which UIs can then ignore

All of this work is to maintain maximum uptime for our users while maintaining the maximum number of features for them to enjoy the richest Netflix experience possible. As a result, our goal is to have the fallbacks deliver responses as close to what the actual dependency would deliver.

Example Use Case

Following is an example of how threads, network timeouts and retries combine:

The above diagram shows an example configuration where the dependency has no reason to hit the 99.5th percentile and thus cuts it short at the network timeout layer and immediately retries with the expectation to get median latency most of the time, and accomplish this all within the 300ms thread timeout.

If the dependency has legitimate reasons to sometimes hit the 99.5th percentile (i.e. cache miss with lazy generation) then the network timeout will be set higher than it, such as at 325ms with 0 or 1 retries and the thread timeout set higher (350ms+).

The threadpool is sized at 10 to handle a burst of 99th percentile requests, but when everything is healthy this threadpool will typically only have 1 or 2 threads active at any given time to serve mostly 40ms median calls.

When configured correctly a timeout at the DependencyCommand layer should be rare, but the protection is there in case something other than network latency affects the time, or the combination of connect+read+retry+connect+read in a worst case scenario still exceeds the configured overall timeout.

The aggressiveness of configurations and tradeoffs in each direction are different for each dependency.

Configurations can be changed in realtime as needed as performance characteristics change or when problems are found all without risking the taking down of the entire app if problems or misconfigurations occur.

Conclusion

The approaches discussed in this post have had a dramatic effect on our ability to tolerate and be resilient to system, infrastructure and application level failures without impacting (or limiting impact to) user experience.

Despite the success of this new DependencyCommand resiliency system over the past 8 months, there is still a lot for us to do in improving our fault tolerance strategies and performance, especially as we continue to add functionality, devices, customers and international markets.

If these kinds of challenges interest you, the API team is actively hiring:

By Vijay Parthasarathy and Denis Sheahan

A number of previous blogs have discussed our adoption of Cassandra as a NoSQL solution in the cloud. We now have over 55 Cassandra clusters in the cloud and are moving our source of truth from our Datacenter to these Cassandra clusters. As part of this move we have not only contributed to Cassandra itself but developed software to ease its deployment and use. It is our plan to open source as much of this software as possible.

We recently announced the open sourcing of Priam, which is a co-process that runs alongside Cassandra on every node to provide backup and recovery, bootstrapping, token assignment, configuration management and a RESTful interface to monitoring and metrics. In January we also announced our Cassandra Java client Astyanax which is built on top of Thrift and provides lower latency, reduced latency variance, and better error handling.

At Netflix we have recently started to standardize our load testing across the fleet using Apache JMeter. As Cassandra is a key part of our infrastructure that needs to be tested we developed a JMeter plugin for Cassandra. In this blog we discuss the plugin and present performance data for Astyanax vs Thrift collected using this plugin.

Cassandra JMeter Plugin

JMeter allows us to customize our test cases based on our application logic/datamodel. The Cassandra JMeter plugin we are releasing today is described on the github wiki here. It consists of a jar file that is placed in JMeter's lib/ext directory. The instructions to build and install the jar file are here.

An example screenshot is shown below.

Benchmark Setup

We set up a simple 6-node Cassandra cluster using EC2 m2.4xlarge instances, and the following schema

create keyspace MemberKeySp
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = [{us-east : 3}]
  and durable_writes = true;
 
use MemberKeySp;
create column family Customer
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'UTF8Type'
  and rows_cached = 0.0
  and keys_cached = 100000.0
  and read_repair_chance = 0.0
  and comment = 'Customer Records';

Six million rows were then inserted into the cluster with a replication factor 3. Each row has 19 columns of simple ascii data. Total data set is 2.9GB per node so easily cacheable in our instances which have 68GB of memory. We wanted to test the latency of the client implementation using a single Get Range Slice operation ie 100% Read only. Each test was run twice to ensure the data was indeed cached, confirmed with iostat. One hundred JMeter threads were used to apply the load with 100 connections from JMeter to each node of Cassandra. Each JMeter thread therefore has at least 6 connections to choose from when sending it's request to Cassandra.

Every Cassandra JMeter Thread Group has a Config Element called CassandraProperties which contains clientType amongst other properties. For Astyanax clientType is set t0 com.netflix.jmeter.connections.a6x.AstyanaxConnection, for Thrift com.netflix.jmeter.connections.thrift.ThriftConnection.

Token Aware is the default JMeter setting. If you wish to experiment with other settings create a properties file, cassandra.properties, in the JMeter home directory with properties from the list below.

astyanax.connection.discovery=
astyanax.connection.pool=
astyanax.connection.latency.stategy=

Results

Transaction throughput

This graph shows the throughput at 5 second intervals for the Token Aware client vs the Thrift client. Token aware is consistently higher than Thrift and its average is 3% better throughput

Average Latency

JMeter reports response times to millisecond granularity. The Token Aware implementation responds in 2ms the majority of the time with occasional 3ms periods, the average is 2.29ms. The Thrift implementation is consistently at 3ms. So Astyanax has about a 30% better response time than raw Thrift implementation without token aware connection pool.

The plugin provides a wide range of samplers for Put, Composite Put, Batch Put, Get, Composite Get, Range Get and Delete. The github wiki has examples for all these scenarios including jmx files to try. Usually we develop the test scenario using the GUI on our laptops and then deploy to the cloud for load testing using the non-GUI version. We often deploy on a number of drivers in order to apply the required level of load.

The data for the above benchmark was also collected using a tool called casstat which we are also making available in the repository. Casstat is a bash script that calls other tools at regular intervals, compares the data with its previous sample, normalizes it on a per second basis and displays the pertinent data on a single line. Under the covers casstat uses

Cassandra nodetool cfstats to get Column Family performance data
nodetool tpstats to get internal state changes
nodetool cfhistograms to get 95th and 99th percentile response times
nodetool compactionstats to get details on number and type of compactions
iostat to get disk and cpu performance data
ifconfig to calculate network bandwidth

An example output is below (note some fields have been removed and abbreviated to reduce the width)

Epoch  Rds/s RdLat ... %user %sys  %idle .... md0r/s w/s  rMB/s wMB/s NetRxK NetTxK Percentile Read             Write               Compacts
133... 5657  0.085 ... 7.74  10.09 81.73  ... 0.00  2.00  0.00  0.05  9083   63414  99th  0.179 ms 95th 0.14 ms 99th 0.00 ms 95th 0.00 ms Pen/0
133... 5635  0.083 ... 7.65  10.12 81.79  ... 0.00  0.30  0.00  0.00  9014   62777  99th  0.179 ms 95th 0.14 ms 99th 0.00 ms 95th 0.00 ms Pen/0
133... 5615  0.085 ... 7.81  10.19 81.54  ... 0.00  0.60  0.00  0.00  9003   62974  99th  0.179 ms 95th 0.14 ms 99th 0.00 ms 95th 0.00 ms Pen/0

We merge the casstat data from each Cassandra node and then use gnuplot to plot throughput etc.

The Cassandra JMeter plugin has become a key part of our load testing environment. We hope the wider community also finds it useful.

When Netflix decided to enter the Android ecosystem, we faced a daunting set of challenges: a) We wanted to release rapidly every 6-8 weeks, b) There were hundreds of Android devices of different shapes, versions, capacities and specifications which need to playback audio and video and c) We wanted to keep the team small andhappy.

Of course, the seasoned tester in you has to admit that these are the sort of problems you like to wake up to every day and solve. Doing it with a group of other software engineers who are passionate about quality is what made overcoming those challenges even more fun.

Release rapidly

You probably guessed that automation had to play a role in this solution. However automating scenarios on the phone or a tablet is complicated when the core functionality of your application is to play back videos natively but you are using an HTML5 interface which lives in the application’s web view.

Verifying an app that uses an embedded web view to serve as its presentation platform was challenging in part due to the dearth of tools available. We considered, Selenium, AndroidNativeDriver and the Android Instrumentation Framework. Unfortunately, we could not use Selenium or the AndroidNativeDriver, because the bulk of our user interactions occur on the HTML5 front end. As a result, we decided to build a slightly modified solution.

Our modified test framework heavily leverages a piece of our product code which bridges JavaScript and native code through a proxy interface. Though we were able to drive some behavior by sending commands through the bridge, we needed an automation hook in order to report state back to the automation framework. Since the HTML document doesn’t expose its title, we decided to use the title element as our hook. We rely on the onReceivedTitle notification as a way to communicate back to our Java code when some Javascript is executed in the HTML5 UI. Through this approach, we were able to execute a variety of tasks by injecting JavaScript into the web view, performing the appropriate DOM inspection task, and then reporting the result through the title property.

With this solution in place, we are able to automate all our key scenarios such as login, browsing the movie catalog, searching and controlling movie playback.

While we automate the testing of playback, the subjective analysis of quality is still left to the tester. Using automation we can catch buffering and other streaming issues by adding testability in our software, but at the end of the day we need a testers to verify issues such as seamless resolution switching or HD quality which are hard to achieve today using automation and also cost prohibitive.

We have a continuous build integration system that allows us to run our automated smoke tests on each submit on a bank of devices. With the framework in place, we are able to quickly ascertain build stability across the vast array of makes and models that are part of the Android ecosystem. This quick and inexpensive feedback loop enables a very quick release cycle as the testing overhead in each release is low given the stakes.

Device Diversity

To put device diversity in context, we see almost around 1000 different devices streaming Netflix on Android every day. We had to figure out how to categorize these devices in buckets so that we can be reasonably sure that we are releasing something that will work properly on these devices. So the devices we choose to participate in our continuous integration system are based on the following criteria.

We have at least one device for each playback pipeline architecture we support (The app uses several approaches for video playback on Android such as hardware decoder, software decoder, OMX-AL, iOMX).
We choose devices with high and low end processors as well as devices with different memory capabilities.
We have representatives that support each major operating system by make in addition to supporting custom ROMs (most notably CM7, CM9).
We choose devices that are most heavily used by Netflix Subscribers.

With this information, we have taken stock of all the devices we have in house and classified them based on their specs. We figured out the optimal combination of devices to give us maximum coverage. We are able to reduce our daily smoke automation devices to around 10 phones and 4 tablets and keep the rest for the longer release wide test cycles.

This list gets updated periodically to adjust to the changing market conditions. Also note that this is only the phone list, we have a separate list for tablets. We have several other phones that we test using automation and a smaller set of high priority tests, the list above goes through the comprehensive suite of manual and automation testing.

To put it other way, when it comes to watching Netflix, any device other than those ten devices can be classified with the high priority devices based on their configuration. This in turn helps us to quickly identify the class of problems associated with the given device.

Small Happy Team

We keep our team lean by focusing our full time employees on building solutions that scale and automation is a key part of this effort. When we do an international launch, we rely on crowd-sourcing test solutions like uTest to quickly verify network and latency performance. This provides us real world insurance that all of our backend systems are working as expected. These approaches give our team time to watch their favorite movies to ensure that we have the best mobile streaming video solution in the industry.

In a future post, we will discuss our iOS test process which provides its own unique set of technical challenges.

Amol Kher is the Engineering Manager in Tools for the Android, iOS and AppleTV teams. If you are interested in joining Netflix or the Mobile team, apply at www.netflix.com/jobs.

by Xavier Amatriain and Justin Basilico (Personalization Science and Engineering)

In this two-part blog post, we will open the doors of one of the most valued Netflix assets: our recommendation system. In Part 1, we will relate the Netflix Prize to the broader recommendation challenge, outline the external components of our personalized service, and highlight how our task has evolved with the business. In Part 2, we will describe some of the data and models that we use and discuss our approach to algorithmic innovation that combines offline machine learning experimentation with online AB testing. Enjoy... and remember that we are always looking for more star talent to add to our great team, so please take a look at our jobs page.

The Netflix Prize and the Recommendation Problem

In 2006 we announced the Netflix Prize, a machine learning and data mining competition for movie rating prediction. We offered $1 million to whoever improved the accuracy of our existing system called Cinematch by 10%. We conducted this competition to find new ways to improve the recommendations we provide to our members, which is a key part of our business. However, we had to come up with a proxy question that was easier to evaluate and quantify: the root mean squared error (RMSE) of the predicted rating. The race was on to beat our RMSE of 0.9525 with the finish line of reducing it to 0.8572 or less.

A year into the competition, the Korbell team won the first Progress Prize with an 8.43% improvement. They reported more than 2000 hours of work in order to come up with the final combination of 107 algorithms that gave them this prize. And, they gave us the source code. We looked at the two underlying algorithms with the best performance in the ensemble: Matrix Factorization (which the community generally called SVD, Singular Value Decomposition) and Restricted Boltzmann Machines (RBM). SVD by itself provided a 0.8914 RMSE, while RBM alone provided a competitive but slightly worse 0.8990 RMSE. A linear blend of these two reduced the error to 0.88. To put these algorithms to use, we had to work to overcome some limitations, for instance that they were built to handle 100 million ratings, instead of the more than 5 billion that we have, and that they were not built to adapt as members added more ratings. But once we overcame those challenges, we put the two algorithms into production, where they are still used as part of our recommendation engine.

If you followed the Prize competition, you might be wondering what happened with the final Grand Prize ensemble that won the $1M two years later. This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Also, our focus on improving Netflix personalization had shifted to the next level by then. In the remainder of this post we will explain how and why it has shifted.

From US DVDs to Global Streaming

One of the reasons our focus in the recommendation algorithms has changed is because Netflix as a whole has changed dramatically in the last few years. Netflix launched an instant streaming service in 2007, one year after the Netflix Prize began. Streaming has not only changed the way our members interact with the service, but also the type of data available to use in our algorithms. For DVDs our goal is to help people fill their queue with titles to receive in the mail over the coming days and weeks; selection is distant in time from viewing, people select carefully because exchanging a DVD for another takes more than a day, and we get no feedback during viewing. For streaming members are looking for something great to watch right now; they can sample a few videos before settling on one, they can consume several in one session, and we can observe viewing statistics such as whether a video was watched fully or only partially.

Another big change was the move from a single website into hundreds of devices. The integration with the Roku player and the Xbox were announced in 2008, two years into the Netflix competition. Just a year later, Netflix streaming made it into the iPhone. Now it is available on a multitude of devices that go from a myriad of Android devices to the latest AppleTV.

Two years ago, we went international with the launch in Canada. In 2011, we added 43 Latin-American countries and territories to the list. And just recently, we launched in UK and Ireland. Today, Netflix has more than 23 million subscribers in 47 countries. Those subscribers streamed 2 billion hours from hundreds of different devices in the last quarter of 2011. Every day they add 2 million movies and TV shows to the queue and generate 4 million ratings.

We have adapted our personalization algorithms to this new scenario in such a way that now 75% of what people watch is from some sort of recommendation. We reached this point by continuously optimizing the member experience and have measured significant gains in member satisfaction whenever we improved the personalization for our members. Let us now walk you through some of the techniques and approaches that we use to produce these recommendations.

Everything is a Recommendation

We have discovered through the years that there is tremendous value to our subscribers in incorporating recommendations to personalize as much of Netflix as possible. Personalization starts on our homepage, which consists of groups of videos arranged in horizontal rows. Each row has a title that conveys the intended meaningful connection between the videos in that group. Most of our personalization is based on the way we select rows, how we determine what items to include in them, and in what order to place those items.

Take as a first example the Top 10 row: this is our best guess at the ten titles you are most likely to enjoy. Of course, when we say “you”, we really mean everyone in your household. It is important to keep in mind that Netflix’ personalization is intended to handle a household that is likely to have different people with different tastes. That is why when you see your Top10, you are likely to discover items for dad, mom, the kids, or the whole family. Even for a single person household we want to appeal to your range of interests and moods. To achieve this, in many parts of our system we are not only optimizing for accuracy, but also for diversity.

Another important element in Netflix’ personalization is awareness. We want members to be aware of how we are adapting to their tastes. This not only promotes trust in the system, but encourages members to give feedback that will result in better recommendations. A different way of promoting trust with the personalization component is to provide explanations as to why we decide to recommend a given movie or show. We are not recommending it because it suits our business needs, but because it matches the information we have from you: your explicit taste preferences and ratings, your viewing history, or even your friends’ recommendations.

On the topic of friends, we recently released our Facebook connect feature in 46 out of the 47 countries we operate – all but the US because of concerns with the VPPA law. Knowing about your friends not only gives us another signal to use in our personalization algorithms, but it also allows for different rows that rely mostly on your social circle to generate recommendations.

Some of the most recognizable personalization in our service is the collection of “genre” rows. These range from familiar high-level categories like "Comedies" and "Dramas" to highly tailored slices such as "Imaginative Time Travel Movies from the 1980s". Each row represents 3 layers of personalization: the choice of genre itself, the subset of titles selected within that genre, and the ranking of those titles. Members connect with these rows so well that we measure an increase in member retention by placing the most tailored rows higher on the page instead of lower. As with other personalization elements, freshness and diversity is taken into account when deciding what genres to show from the thousands possible.

We present an explanation for the choice of rows using a member’s implicit genre preferences – recent plays, ratings, and other interactions --, or explicit feedback provided through our taste preferences survey. We will also invite members to focus a row with additional explicit preference feedback when this is lacking.

Similarity is also an important source of personalization in our service. We think of similarity in a very broad sense; it can be between movies or between members, and can be in multiple dimensions such as metadata, ratings, or viewing data. Furthermore, these similarities can be blended and used as features in other models. Similarity is used in multiple contexts, for example in response to a member's action such as searching or adding a title to the queue. It is also used to generate rows of “adhoc genres” based on similarity to titles that a member has interacted with recently. If you are interested in a more in-depth description of the architecture of the similarity system, you can read about it in this past post on the blog.

In most of the previous contexts – be it in the Top10 row, the genres, or the similars – ranking, the choice of what order to place the items in a row, is critical in providing an effective personalized experience. The goal of our ranking system is to find the best possible ordering of a set of items for a member, within a specific context, in real-time. We decompose ranking into scoring, sorting, and filtering sets of movies for presentation to a member. Our business objective is to maximize member satisfaction and month-to-month subscription retention, which correlates well with maximizing consumption of video content. We therefore optimize our algorithms to give the highest scores to titles that a member is most likely to play and enjoy.

Now it is clear that the Netflix Prize objective, accurate prediction of a movie's rating, is just one of the many components of an effective recommendation system that optimizes our members enjoyment. We also need to take into account factors such as context, title popularity, interest, evidence, novelty, diversity, and freshness. Supporting all the different contexts in which we want to make recommendations requires a range of algorithms that are tuned to the needs of those contexts. In the next part of this post, we will talk in more detail about the ranking problem. We will also dive into the data and models that make all the above possible and discuss our approach to innovating in this space.

On to part 2

by Jordan Zimmerman

ZooKeeper
ZooKeeper is a high-performance coordination service for distributed applications. We've already open sourced a client library for ZooKeeper, Curator. In addition to Curator for clients, we found a need for a supervisor service that runs alongside ZooKeeper server instances. Thus, we are introducing Exhibitor and open sourcing it.

ZooKeeper Administrative Issues
Managing a ZooKeeper cluster requires a lot of manual effort — see the ZooKeeper Administrator's Guide for details. In particular, ZooKeeper is statically configured. The instances that comprise a ZooKeeper ensemble must be hard coded into a configuration file that must be identical on each ZooKeeper instance. Once the ZooKeeper instances are started it's not possible to reconfigure the ensemble without updating the configuration file and restarting the instances. If not properly done, ZooKeeper can lose quorum and client's can perceive the ensemble as being unavailable.

In addition to static configuration issues, ZooKeeper requires regular maintenance. When using ZooKeeper versions prior to 3.4.x you are advised to periodically clean up the ZooKeeper log files. Also, you are advised to have a monitor of some kind to assure that each ZooKeeper instance is up and serving requests.

Exhibitor Features
Exhibitor provides a number of features that make managing a ZooKeeper ensemble much easier:

Instance Monitoring: Each Exhibitor instance monitors the ZooKeeper server running on the same server. If ZooKeeper is not running (due to crash, etc.), Exhibitor will rewrite the zoo.cfg file and restart it.
Log Cleanup: In versions prior to ZooKeeper 3.4.x log file maintenance is necessary. Exhibitor will periodically do this maintenance.
Backup/Restore: Exhibitor can periodically backup the ZooKeeper transaction files. Once backed up, you can index any of these transaction files. Once indexed, you can search for individual transactions and “replay” them to restore a given ZNode to ZooKeeper.
Cluster-wide Configuration: Exhibitor attempts to present a single console for your entire ZooKeeper ensemble. Configuration changes made in Exhibitor will be applied to the entire ensemble.
Rolling Ensemble Changes: Exhibitor can update the servers in the ensemble in a rolling fashion so that the ZooKeeper ensemble can stay up and in quorum while the changes are being made.
Visualizer: Exhibitor provides a graphical tree view of the ZooKeeper ZNode hierarchy.
Curator Integration: Exhibitor and Curator (Cur/Ex!) can be configured to work together so that Curator instances are updated for changes in the ensemble.
Rich REST API: Exhibitor exposes a REST API for programmatic integration.

Easy To Use GUI
Exhibitor is an easy to use web application with a modern UI:

Easy To Integrate
There are two versions of Exhibitor:

Standalone: The standalone version comes pre-configured as a Jetty-based self-contained application.
Core: The core version can be integrated into an existing application or you can build an extended application around it.

For More Information...

Exhibitor binaries are posted to Maven Central.
The source code for Exhibitor is hosted at Github: https://github.com/Netflix/exhibitor
Extensive documentation is currently at https://github.com/Netflix/exhibitor/wiki

We're Hiring!
Like what you see? Netflix is a great place for programmers. Check out our Jobs Board today.

by Josh Evans

As you may know, the world is quickly running out of available IPv4 addresses. Some basic math highlights why. IPv4 addresses are limited to a 32-bit address space. This means that there are roughly 4 billion possible addresses on the internet. To put this into perspective, there were roughly 7 billion people on the planet in 2011. This equates to less than one IP address per person, which is not sufficient for the needs of a global internet community and supporting infrastructure.

Enter IPv6, the next-generation Internet protocol, which has a 128-bit address space. This equates to ~4.8×1028 addresses for each of those seven billion people alive in 2011. This is more than enough for you to have IP addresses for your computer, mobile phone, connected TV, refrigerator, toaster, and coffee maker with room to spare.

In order to support the global expansion of the Netflix streaming service and ease the growing burden of IPv4 address exhaustion on ISPs, Netflix is proud to be participating in the World IPv6 Launch on June 6, 2012. This is an event in which “major Internet service providers (ISPs), home networking equipment manufacturers, and web companies around the world are coming together to permanently enable IPv6 for their products and services”.

For Netflix, our initial IPv6 deployment involves the Netflix website and video streaming on the PC and Mac platforms. We will follow on with other streaming platforms. There’s no action required for Netflix members. We’ll continue to support IPv4, and IPv6 will simply work when your ISP “lights up” support in their networks.

We will follow up with another post soon to share the technical details about our deployment. Stay tuned…

Josh

Josh Evans is director of streaming infrastructure, responsible for Netflix services which support streaming playback and device activation. If you are interested in joining Streaming Infrastructure or another team at Netflix, apply at www.netflix.com/jobs.

Running the Netflix Cloud

Moving to the cloud presented new challenges for us[1] and forced us to develop new design patterns for running a reliable and resilient distributed system[2]. We’ve focused many of our past posts on the technical hurdles we overcame to run successfully in the cloud. However, we had to make operational and organizational transformations as well. We want to share the way we think about operations at Netflix to help others going through a similar journey. In putting this post together, we realized there’s so much to share that we decided to make this a first in a series of posts on operations at Netflix.

The old guard

When we were running out of our data center, Netflix was a large monolithic Java application running inside of a tomcat container. Every two weeks, the deployment train left at exactly the same time and anyone wanting to deploy a production change needed to have their code checked-in and tested before departure time. This also meant that anyone could check-in bad code and bring the entire train to a halt while the issue was diagnosed and resolved. Deployments were heavy and risk-laden and, because of all the moving parts going into each deployment, it was handled by a centralized team that was part of ITOps.

Production support was similarly centralized within ITOps. We had a traditional NOC that monitored charts and graphs and was called when a service interruption occurred. They were organizationally separate from the development team. More importantly, there was a large cultural divide between the operations and development teams because of the mismatched goals of site uptime versus features and velocity of innovation.

Built for scale in the cloud

In moving to the cloud, we saw an opportunity to recast the mold for how we build and deploy our software. We used the cloud migration as an opportunity to re-architect our system into a service oriented architecture with hundreds of individual services. Each service could be revved on its own deployment schedule, often weekly, empowering each team to deliver innovation at its own desired pace. We unwound the centralized deployment team and distributed the function into the teams that owned each service.

Post-deployment support was similarly distributed as part of the cloud migration. The NOC was disbanded and a new Site Reliability Engineering team was created within the development organization not to operate the system, but to provide system-wide analysis and development around reliability and resiliency.

The road to a distributed future

As the scale of web applications has grown over time due to the addition of features and growth of usage, the application architecture has changed radically. There are a number of things that exemplify this: service oriented architecture, eventually consistent data stores, map-reduce, etc. The fundamental thing that they all share is a distributed architecture that involves numerous applications, servers and interconnections. For Netflix this meant moving from a few teams checking code into a large monolithic application running on tens of servers to having tens of engineering teams developing hundreds of component services that run on thousands of servers.

The Netflix distributed system

As all of these changes occurred on the engineering side, we had to modify the way that we think about and organize for operations as well. Our approach has been to make operations itself a distributed system. Each engineering team is responsible for coding, testing and operating its systems in the production environment. The result is that each team develops the expertise in the operational areas it most needs and then we leverage that knowledge across the organization. There is an argument that developers needing to fundamentally understand how to operate, monitor and improve the resiliency of their applications in production is a distraction from their “real” work. However, our experience has been that the added ownership actually leads to more robust applications and greater agility than centralizing these efforts. For example, we've found that making the developers responsible for fixing their own code at 4am has encouraged them to write more robust code that handles failure gracefully, as a way of avoiding getting another 4am call. Our developers get more work done more quickly than before.

As we grew, it quickly became clear that centralized operations was not well suited for our new use case. Our production environment is too complex for any one team or organization to understand well, which meant that they were forced to either make ill-informed decisions based on their perceptions, or get caught in a game of telephone tag with different development teams. We also didn’t want our engineering teams to be tightly coupled to each other when they were making changes to their applications.

In addition to distributing the operations experience throughout development, we also heavily invested in tools and automation. We created a number of engineering teams to focus on high volume monitoring and event correlation, end-to-end continuous integration and builds, and automated deployment tools[3][4]. These tools are critical to limiting the amount of extra work developers must do in order to manage the environment while also providing them with the information that they need to make smart decisions about when and how they deploy, vet and diagnose issues with each new code deployment.

Conclusion

Our architecture and code base evolved and adapted to our new cloud-based environment. Our operations have evolved as well. Both aim to be distributed and scalable. Once a centralized function, operations is now distributed throughout the development organization. Successful operations at Netflix is a distributed system, much like our software, that relies on algorithms, tools, and automation to scale to meet the demands of our ever-growing user-base and product.

In future posts, we’ll explore the various aspects of operations at Netflix in more depth. If you want to see any area explored in more detail, comment below or tweet us.

If you’re passionate about building and running massive-scale web applications, we’re always looks for amazing developers and site reliability engineers. See all our open positions at www.netflix.com/jobs.

- Ariel Tseitlin (@atseitlin), Director of Cloud Solutions
- Greg Orzell (@chaossimia), Cloud & Platform Engineering Architect

[1] http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
[2] http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
[3] http://www.slideshare.net/joesondow/asgard-the-grails-app-that-deploys-netflix-to-the-cloud
[4] http://www.slideshare.net/carleq/building-cloudtoolsfornetflixcode-mash2012

By Allen Wang and Sudhir Tonse

Netflix has a culture of being dynamic when it comes to decision making. This trait comes across both in the business domain as well as in technology and operations.
It follows that we like the ability to effect changes in the behavior of our deployed services dynamically at run-time. Availability is of the utmost importance to us, so we would like to accomplish this without having to bounce servers.
Furthermore, we want the ability to dynamically change properties (and hence the logic and behavior of our services) based on a request or deployment context. For example, we want to configure properties for an application instance or request, based on factors like the Amazon Region the service is deployed in, the country of origin (of the request), the device the movie is playing on etc.

What is Archaius?

(Image obtained from http://en.wikipedia.org/wiki/File:Calumma_tigris-2.jpg)

Archaius, is the dynamic, multi dimensional, properties framework that addresses these requirements and use cases.
The code name for the project comes from an endangered species of Chameleons. More information can be found at http://en.wikipedia.org/wiki/Archaius_tigris. We chose Archaius, as Chameleons are known for changing their color (a property) based on their environment and situation.

We are pleased to announce the public availability of Archaius as an important milestone in our continued goal of open sourcing the Netflix Platform Stack. (Available at http://github.com/netflix)

Why Archaius?

To understand why we built Archaius, we need to enumerate the pain points of configuration management and the ecosystem that the system operates in. Some of these are captured below, and drove the requirements.

Static updates require server pushes; this was operationally undesirable and caused a dent in the availability of the service/application.
A Push method of updating properties could not be employed as this system would need to know all the server instances to push the configuration to at any given point in time ((i.e. the list of hostnames and property locations).

This was a possibility in our own data center where we owned all the servers. In the cloud, the instances are ephemeral and their hostnames/ip addresses are not known in advance. Furthermore, the number of these instances fluctuate based on the ASG settings. (for more information on how Netflix uses Auto Scaling Group feature of AWS, please visit here or here).

Given that property changes had to be applied at run time, it was clear that the codebase had to use a common mechanism which allowed it to consume properties in a uniform manner, from different sources (both static and dynamic).
There was a need to have different properties for different applications and services under different contexts. See the section "Netflix Deployment Overview" for an overview of services and context.
Property changes needed to be journaled. This allowed us to correlate any issues in production to a corresponding run time property change.
Properties had to be applied based on the Context. i.e. The property had to be multi dimensional. At Netflix, the context was based on "dimensions" such as Environment (development, test, production), Deployed Region (us-east-1, us-west-1 etc.), "Stack" (a concept in which each app and the services in its dependency graph were isolated for a specific purpose; e.g. "iPhone App launch Stack") etc.

Use Cases/Examples

Enable or disable certain features based on the request context.
A UI presentation logic layer may have a default configuration to display 10 Movie Box Shots in a single display row. If we determine that we would like to display 5 instead, we can do so using Archaius' Dynamic Properties.
We can override the behaviors of the circuit breakers. Reference: Resiliency and Circuit breakers
Connection and request timeouts for calls to internal and external services can be adjusted as needed
In case we get alerted on errors observed in certain services, we can change the Log Levels (i.e. DEBUG, WARN etc.) dynamically for particular packages/components on these services. This enables us to parse the log files to inspect these errors. Once we are done inspecting the logs, we can reset the Log Levels using Dynamic Properties.
Now that Netflix is deployed in an ever growing global infrastructure, Dynamic Properties allow us to enable different characteristics and features based on the International market.
Certain infrastructural components benefit from having configurations changed at Runtime based on aggregate site wide behavior. For e.g. a distributed cache component's TTL (time to live) can be changed at runtime based on external factors.
Connection pools had to be set differently for the same client library based on which application/service it was deployed in. (For example, in a light weight, low Requests Per Second (RPS) application, the number of connections in a connection pool to a particular service/db will be set to a lower number compared to a high RPS application)
The changes in properties can be effected on on a particular instance, a particular region, a stack of deployed services or an entire farm of a particular application at run-time.

Netflix Deployment Overview

Example Deployment Context

Environment = TEST
Region = us-east-1
Stack = MyTestStack
AppName = cherry

The diagram above shows a hypothetical simplistic overview of a typical deployment architecture at Netflix. Netflix has several services and applications that are consumer facing. These are referred to as Edge Services/Applications. These are typically fronted by Amazon's ELB. Each application/service depends on a set of mid-tier services and persistence technologies (Amazon S3, Cassandra etc.) sometimes fronted by a distributed cache.

Every service or application has a unique "AppName" associated with it. Most services at Netflix are stateless and hosted on multiple instances deployed across multiple Availability Zones of an Amazon Region. The available environments could be "test" or "production" etc. A Stack is logical grouping. For example, an Application and the Mid-Tier Services in its dependency graph can all be logically grouped as belonging to a Stack called "MyTestStack". This is typically done to run different tests on isolated and controlled deployments.

The red oval boxes in the diagram above called "Shared Libraries" are the various common code used by multiple applications. For example, Astyanax, our open sourced Cassandra Client is one such shared library. Turns out that we may need to configure the connection pool differently for each of the applications that is using the Astyanax library. Furthermore it could vary in different Amazon Regions and within different "Stacks" of deployments. Sometimes, we may want to tweak this connection pool parameter at runtime. These are the capabilities that Archaius offers.
i.e. The ability to specifically target a subset or an aggregation of components with a view towards configuring their behavior at static (initial loading) or runtime is what enables us to address the use cases outlined above.

The examples and diagrams in this article show a representative view of how Archaius is used at Netflix. Archaius, the Open sourced version of the project is configurable and extendable to meet your specific needs and deployment environment (even if your deployment of choice is not the EC2 Cloud).

Overview of Archaius

Archaius includes a set of java configuration management APIs that are used at Netflix. It is primarily implemented as an extension of Apache's Common Configuration library. Notable features are:

Dynamic, Typed Properties
High throughput and Thread Safe Configuration operations
A polling framework that allows for obtaining property changes from a Configuration Source
A Callback mechanism that gets invoked on effective/"winning" property mutations (in the ordered hierarchy of Configurations)
A JMX MBean that can be accessed via JConsole to inspect and invoke operations on properties

At the heart of Archaius is the concept of Composite Configuration which is an ordered list of one or more Configurations. Each Configuration can be sourced from a Configuration Source such as JDBC, REST API, a .properties file etc. Configuration Sources can optionally be polled at runtime for changes (In the above diagram, the Persisted DB Configuration Source which is an RDBMS containing properties in a table, is polled every so often for changes). The final value of a property is determined based on the top most Configuration that contains that property. i.e. If a property is present in multiple configurations, the actual value seen by the application will be the value that is present in the topmost slot in the hierarchy of Configurations. The order of the configurations in the hierarchy can be configured.

A rough template for handling a request and using Dynamic Property based execution is shown below:

void handleFeatureXYZRequest(Request params ...){
  if (featureXYZDynamicProperty.get().equals("useLongDescription"){
   showLongDescription();
  } else {
   showShortSnippet();
  }
}
The source code for Archaius is hosted on GitHub at https://github.com/Netflix/archaius.
   ReferencesApache's Common      Configuration library
Archaius Features
Archaius User      Guide
   ConclusionArchaius forms an important component of the Netflix Cloud Platform. It offers the ability to control various sub systems and components at runtime without any impact to the availability of the services. We hope that this is a useful addition to the list of projects open sourced by Netflix, and invite the open source community to help us improve Archaius and other components.

Interested in helping us take Netflix Cloud Platform to the next level? We are looking for talented engineers.

- Allen Wang, Sr. Software Engineer, Cloud Platform (Core Infrastructure)
- Sudhir Tonse (@stonse), Manager, Cloud Platform (Core Infrastructure)

by Xavier Amatriain and Justin Basilico (Personalization Science and Engineering)

In part one of this blog post, we detailed the different components of Netflix personalization. We also explained how Netflix personalization, and the service as a whole, have changed from the time we announced the Netflix Prize.The $1M Prize delivered a great return on investment for us, not only in algorithmic innovation, but also in brand awareness and attracting stars (no pun intended) to join our team. Predicting movie ratings accurately is just one aspect of our world-class recommender system. In this second part of the blog post, we will give more insight into our broader personalization technology. We will discuss some of our current models, data, and the approaches we follow to lead innovation and research in this space.

Ranking

The goal of recommender systems is to present a number of attractive items for a person to choose from. This is usually accomplished by selecting some items and sorting them in the order of expected enjoyment (or utility). Since the most common way of presenting recommended items is in some form of list, such as the various rows on Netflix, we need an appropriate ranking model that can use a wide variety of information to come up with an optimal ranking of the items for each of our members.

If you are looking for a ranking function that optimizes consumption, an obvious baseline is item popularity. The reason is clear: on average, a member is most likely to watch what most others are watching. However, popularity is the opposite of personalization: it will produce the same ordering of items for every member. Thus, the goal becomes to find a personalized ranking function that is better than item popularity, so we can better satisfy members with varying tastes.

Recall that our goal is to recommend the titles that each member is most likely to play and enjoy. One obvious way to approach this is to use the member's predicted rating of each item as an adjunct to item popularity. Using predicted ratings on their own as a ranking function can lead to items that are too niche or unfamiliar being recommended, and can exclude items that the member would want to watch even though they may not rate them highly. To compensate for this, rather than using either popularity or predicted rating on their own, we would like to produce rankings that balance both of these aspects. At this point, we are ready to build a ranking prediction model using these two features.

There are many ways one could construct a ranking function ranging from simple scoring methods, to pairwise preferences, to optimization over the entire ranking. For the purposes of illustration, let us start with a very simple scoring approach by choosing our ranking function to be a linear combination of popularity and predicted rating. This gives an equation of the form f_rank(u,v) = w₁ p(v) + w₂ r(u,v) + b, where u=user, v=video item, p=popularity and r=predicted rating. This equation defines a two-dimensional space like the one depicted below.

Once we have such a function, we can pass a set of videos through our function and sort them in descending order according to the score. You might be wondering how we can set the weights w₁ and w₂ in our model (the bias b is constant and thus ends up not affecting the final ordering). In other words, in our simple two-dimensional model, how do we determine whether popularity is more or less important than predicted rating? There are at least two possible approaches to this. You could sample the space of possible weights and let the members decide what makes sense after many A/B tests. This procedure might be time consuming and not very cost effective. Another possible answer involves formulating this as a machine learning problem: select positive and negative examples from your historical data and let a machine learning algorithm learn the weights that optimize your goal. This family of machine learning problems is known as "Learning to rank" and is central to application scenarios such as search engines or ad targeting. Note though that a crucial difference in the case of ranked recommendations is the importance of personalization: we do not expect a global notion of relevance, but rather look for ways of optimizing a personalized model.

As you might guess, apart from popularity and rating prediction, we have tried many other features at Netflix. Some have shown no positive effect while others have improved our ranking accuracy tremendously. The graph below shows the ranking improvement we have obtained by adding different features and optimizing the machine learning algorithm.

Many supervised classification methods can be used for ranking. Typical choices include Logistic Regression, Support Vector Machines, Neural Networks, or Decision Tree-based methods such as Gradient Boosted Decision Trees (GBDT). On the other hand, a great number of algorithms specifically designed for learning to rank have appeared in recent years such as RankSVM or RankBoost. There is no easy answer to choose which model will perform best in a given ranking problem. The simpler your feature space is, the simpler your model can be. But it is easy to get trapped in a situation where a new feature does not show value because the model cannot learn it. Or, the other way around, to conclude that a more powerful model is not useful simply because you don't have the feature space that exploits its benefits.

Data and Models

The previous discussion on the ranking algorithms highlights the importance of both data and models in creating an optimal personalized experience for our members. At Netflix, we are fortunate to have many relevant data sources and smart people who can select optimal algorithms to turn data into product features. Here are some of the data sources we can use to optimize our recommendations:

We have several billion item ratings from members. And we receive millions of new ratings a day.
We already mentioned item popularity as a baseline. But, there are many ways to compute popularity. We can compute it over various time ranges, for instance hourly, daily, or weekly. Or, we can group members by region or other similarity metrics and compute popularity within that group.
We receive several million stream plays each day, which include context such as duration, time of day and device type.
Our members add millions of items to their queues each day.
Each item in our catalog has rich metadata: actors, director, genre, parental rating, and reviews.
Presentations: We know what items we have recommended and where we have shown them, and can look at how that decision has affected the member's actions. We can also observe the member's interactions with the recommendations: scrolls, mouse-overs, clicks, or the time spent on a given page.
Social data has become our latest source of personalization features; we can process what connected friends have watched or rated.
Our members directly enter millions of search terms in the Netflix service each day.
All the data we have mentioned above comes from internal sources. We can also tap into external data to improve our features. For example, we can add external item data features such as box office performance or critic reviews.
Of course, that is not all: there are many other features such as demographics, location, language, or temporal data that can be used in our predictive models.

So, what about the models? One thing we have found at Netflix is that with the great availability of data, both in quantity and types, a thoughtful approach is required to model selection, training, and testing. We use all sorts of machine learning approaches: From unsupervised methods such as clustering algorithms to a number of supervised classifiers that have shown optimal results in various contexts. This is an incomplete list of methods you should probably know about if you are working in machine learning for personalization:

Linear regression
Logistic regression
Elastic nets
Singular Value Decomposition
Restricted Boltzmann Machines
Markov Chains
Latent Dirichlet Allocation
Association Rules
Gradient Boosted Decision Trees
Random Forests
Clustering techniques from the simple k-means to novel graphical approaches such as Affinity Propagation
Matrix factorization

Consumer Data Science

The abundance of source data, measurements and associated experiments allow us to operate a data-driven organization. Netflix has embedded this approach into its culture since the company was founded, and we have come to call it Consumer (Data) Science. Broadly speaking, the main goal of our Consumer Science approach is to innovate for members effectively. The only real failure is the failure to innovate; or as Thomas Watson Sr, founder of IBM, put it: “If you want to increase your success rate, double your failure rate.” We strive for an innovation culture that allows us to evaluate ideas rapidly, inexpensively, and objectively. And, once we test something we want to understand why it failed or succeeded. This lets us focus on the central goal of improving our service for our members.

So, how does this work in practice? It is a slight variation over the traditional scientific process called A/B testing (or bucket testing):

1. Start with a hypothesis

Algorithm/feature/design X will increase member engagement with our service and ultimately member retention

2. Design a test

Develop a solution or prototype. Ideal execution can be 2X as effective as a prototype, but not 10X.
Think about dependent & independent variables, control, significance…

3. Execute the test

4. Let data speak for itself

When we execute A/B tests, we track many different metrics. But we ultimately trust member engagement (e.g. hours of play) and retention. Tests usually have thousands of members and anywhere from 2 to 20 cells exploring variations of a base idea. We typically have scores of A/B tests running in parallel. A/B tests let us try radical ideas or test many approaches at the same time, but the key advantage is that they allow our decisions to be data-driven. You can read more about our approach to A/B Testing in this previous tech blog post or in some of the Quora answers by our Chief Product Officer Neil Hunt.

An interesting follow-up question that we have faced is how to integrate our machine learning approaches into this data-driven A/B test culture at Netflix. We have done this with an offline-online testing process that tries to combine the best of both worlds. The offline testing cycle is a step where we test and optimize our algorithms prior to performing online A/B testing. To measure model performance offline we track multiple metrics used in the machine learning community: from ranking measures such as normalized discounted cumulative gain, mean reciprocal rank, or fraction of concordant pairs, to classification metrics such as accuracy, precision, recall, or F-score. We also use the famous RMSE from the Netflix Prize or other more exotic metrics to track different aspects like diversity. We keep track of how well those metrics correlate to measurable online gains in our A/B tests. However, since the mapping is not perfect, offline performance is used only as an indication to make informed decisions on follow up tests.

Once offline testing has validated a hypothesis, we are ready to design and launch the A/B test that will prove the new feature valid from a member perspective. If it does, we will be ready to roll out in our continuous pursuit of the better product for our members. The diagram below illustrates the details of this process.

An extreme example of this innovation cycle is what we called the Top10 Marathon. This was a focused, 10-week effort to quickly test dozens of algorithmic ideas related to improving our Top10 row. Think of it as a 2-month hackathon with metrics. Different teams and individuals were invited to contribute ideas and code in this effort. We rolled out 6 different ideas as A/B tests each week and kept track of the offline and online metrics. The winning results are already part of our production system.

Conclusion

The Netflix Prize abstracted the recommendation problem to a proxy question of predicting ratings. But member ratings are only one of the many data sources we have and rating predictions are only part of our solution. Over time we have reformulated the recommendation problem to the question of optimizing the probability a member chooses to watch a title and enjoys it enough to come back to the service. More data availability enables better results. But in order to get those results, we need to have optimized approaches, appropriate metrics and rapid experimentation.

To excel at innovating personalization, it is insufficient to be methodical in our research; the space to explore is virtually infinite. At Netflix, we love choosing and watching movies and TV shows. We focus our research by translating this passion into strong intuitions about fruitful directions to pursue; under-utilized data sources, better feature representations, more appropriate models and metrics, and missed opportunities to personalize. We use data mining and other experimental approaches to incrementally inform our intuition, and so prioritize investment of effort. As with any scientific pursuit, there’s always a contribution from Lady Luck, but as the adage goes, luck favors the prepared mind. Finally, above all, we look to our members as the final judges of the quality of our recommendation approach, because this is all ultimately about increasing our members' enjoyment in their own Netflix experience. We are always looking for more people to join our team of "prepared minds". Make sure you take a look at our jobs page.

By Joe Sondow, Engineering Tools

For the past several years Netflix developers have been using self-service tools to build and deploy hundreds of applications and services to the Amazon cloud. One of those tools is Asgard, a web interface for application deployments and cloud management.

Asgard is named for the home of the Norse god of thunder and lightning, because Asgard is where Netflix developers go to control the clouds. I’m happy to announce that Asgard has now been open sourced on github and is available for download and use by anyone. All you’ll need is an Amazon Web Services account. Like other open source Netflix projects, Asgard is released under the Apache License, Version 2.0. Please feel free to fork the project and make improvements to it.

Some of the information in this blog post is also published in the following presentations. Note that Asgard was originally named the Netflix Application Console, or NAC.

Visual Language for the Cloud

To help people identify various types of cloud entities, Asgard uses the Tango open source icon set, with a few additions. These icons help establish a visual language to help people understand what they are looking at as they navigate. Tango icons look familiar because they are also used by Jenkins, Ubuntu, Mediawiki, Filezilla, and Gimp. Here is a sampling of Asgard's cloud icons.

Cloud Model

The Netflix cloud model includes concepts that AWS does not support directly: Applications and Clusters.

Application

Below is a diagram of some of the Amazon objects required to run a single front-end application such as Netflix’s autocomplete service.

Here’s a quick summary of the relationships of these cloud objects.

An Auto Scaling Group (ASG) can attach zero or more Elastic Load Balancers (ELBs) to new instances.
An ELB can send user traffic to instances.
An ASG can launch and terminate instances.
For each instance launch, an ASG uses a Launch Configuration.
The Launch Configuration specifies which Amazon Machine Image (AMI) and which Security Groups to use when launching an instance.
The AMI contains all the bits that will be on each instance, including the operating system, common infrastructure such as Apache and Tomcat, and a specific version of a specific Application.
Security Groups can restrict the traffic sources and ports to the instances.

That’s a lot of stuff to keep track of for one application.

When there are large numbers of those cloud objects in a service-oriented architecture (like Netflix has), it’s important for a user to be able to find all the relevant objects for their particular application. Asgard uses an application registry in SimpleDB and naming conventions to associate multiple cloud objects with a single application. Each application has an owner and an email address to establish who is responsible for the existence and state of the application's associated cloud objects.

Asgard limits the set of permitted characters in the application name so that the names of other cloud objects can be parsed to determine their association with an application.

Here is a screenshot of Asgard showing a filtered subset of the applications running in our production account in the Amazon cloud in the us-east-1 region:

Screenshot of a detail screen for a single application, with links to related cloud objects:

Cluster

On top of the Auto Scaling Group construct supplied by Amazon, Asgard infers an object called a Cluster which contains one or more ASGs. The ASGs are associated by naming convention. When a new ASG is created within a cluster, an incremented version number is appended to the cluster's "base name" to form the name of the new ASG. The Cluster provides Asgard users with the ability to perform a deployment that can be rolled back quickly.

Example:During a deployment, cluster obiwan contains ASGs obiwan-v063 and obiwan-v064. Here is a screenshot of a cluster in mid-deployment.

The old ASG is “disabled” meaning it is not taking traffic but remains available in case a problem occurs with the new ASG. Traffic comes from ELBs and/or from Discovery, an internal Netflix service that is not yet open sourced.

Deployment Methods

Fast Rollback

One of the primary features of Asgard is the ability to use the cluster screen shown above to deploy a new version of an application in a way that can be reversed at the first sign of trouble. This method requires more instances to be in use during deployment, but it can greatly reduce the duration of service outages caused by bad deployments.

This animated diagram shows a simplified process of using the Cluster interface to try out a deployment and roll it back quickly when there is a problem:

The animation illustrates the following deployment use case:

Create the new ASG obiwan-v064
Enable traffic to obiwan-v064
Disable traffic on obiwan-v063
Monitor results and notice that things are going badly
Re-enable traffic on obiwan-v063
Disable traffic on obiwan-v064
Analyze logs on bad servers to diagnose problems
Delete obiwan-v064

Rolling Push

Asgard also provides an alternative deployment system called a rolling push. This is similar to a conventional data center deployment of a cluster on application servers. Only one ASG is needed. Old instances get gracefully deleted and replaced by new instances one or two at a time until all the instances in the ASG have been replaced.Rolling pushes are useful:

If an ASG's instances are sharded so each instance has a distinct purpose that should not be duplicated by another instance.
If the clustering mechanisms of the application (such as Cassandra) cannot support sudden increases in instance count for the cluster.

Downsides to a rolling push:

Replacing instances in small batches can take a long time.
Reversing a bad deployment can take a long time.

Task Automation

Several common tasks are built into Asgard to automate the deployment process. Here is an animation showing a time-compressed view of a 14-minute automated rolling push in action:

Auto Scaling

Netflix focuses on the ASG as the primary unit of deployment, so Asgard also provides a variety of graphical controls for modifying an ASG and setting up metrics-driven auto scaling when desired.

CloudWatch metrics can be selected from the default provided by Amazon such as CPUUtilization, or can be custom metrics published by your application using a library like Servo for Java.

Why not the AWS Management Console?

The AWS Management Console has its uses for someone with your Amazon account password who needs to configure something Asgard does not provide. However, for everyday large-scale operations, the AWS Management Console has not yet met the needs of the Netflix cloud usage model, so we built Asgard instead. Here are some of the reasons.

Hide the Amazon keys
Netflix grants its employees a lot of freedom and responsibility, including the rights and duties of enhancing and repairing production systems. Most of those systems run in the Amazon cloud. Although we want to enable hundreds of engineers to manage their own cloud apps, we prefer not to give all of them the secret keys to access the company’s Amazon accounts directly. Providing an internal console allows us to grant Asgard users access to our Amazon accounts without telling too many employees the shared cloud passwords. This strategy also saves us from needing to assign and revoke hundreds of Identity and Access Management (IAM) cloud accounts for employees.

Auto Scaling Groups
As of this writing the AWS Management Console lacks support for Auto Scaling Groups (ASGs). Netflix relies on ASGs as the basic unit of deployment and management for instances of our applications. One of our goals in open sourcing Asgard is to help other Amazon customers make greater use of Amazon’s sophisticated auto scaling features. ASGs are a big part of the Netflix formula to provide reliability, redundancy, cost savings, clustering, discoverability, ease of deployment, and the ability to roll back a bad deployment quickly.

Enforce Conventions
Like any growing collection of things users are allowed to create, the cloud can easily become a confusing place full of expensive, unlabeled clutter. Part of the Netflix Cloud Architecture is the use of registered services associated with cloud objects by naming convention. Asgard enforces these naming conventions in order to keep the cloud a saner place that is possible to audit and clean up regularly as things get stale, messy, or forgotten.

Logging
So far the AWS console does not expose a log of recent user actions on an account. This makes it difficult to determine whom to call when a problem starts, and what recent changes might relate to the problem. Lack of logging is also a non-starter for any sensitive subsystems that legally require auditability.

Integrate Systems
Having our own console empowers us to decide when we want to add integration points with our other engineering systems such as Jenkins and our internal Discovery service.

Automate Workflow
Multiple steps go into a safe, intelligent deployment process. By knowing certain use cases in advance Asgard can perform all the necessary steps for a deployment based on one form submission.

Simplify REST API
For common operations that other systems need to perform, we can expose and publish our own REST API to do exactly what we want in a way that hides some of the complex steps from the user.

Costs

When using cloud services, it’s important to keep a lid on your costs. As of June 5, 2012, Amazon now provides a way to track your account’s charges frequently. This data is not exposed through Asgard as of this writing, but someone in your company should keep track of your cloud costs regularly. See http://aws.typepad.com/aws/2012/06/new-programmatic-access-to-aws-billing-data.html

Starting up Asgard does not initially cause you to incur any Amazon charges, because Amazon has a free tier for SimpleDB usage and no charges for creating Security Groups, Launch Configurations, or empty Auto Scaling Groups. However, as soon as you increase the size of an ASG above zero Amazon will begin charging you for instance usage, depending on your status for Amazon’s Free Usage Tier. Creating ELBs, RDS instances, and other cloud objects can also cause you to incur charges. Become familiar with the costs before creating too many things in the cloud, and remember to delete your experiments as soon as you no longer need them. Your Amazon costs are your own responsibility, so run your cloud operations wisely.

Cost references: http://aws.amazon.com/ec2/pricing/

Feature Films

By extraordinary coincidence, Thor and Thor: Tales of Asgard are now available to watch on Netflix streaming.

Conclusion

Asgard has been one of the primary tools for application deployment and cloud management at Netflix for years. By releasing Asgard to the open source community we hope more people will find the Amazon cloud and Auto Scaling easier to work with, even at large scale like Netflix. More Asgard features will be released regularly, and we welcome participation by users on GitHub.

Follow the Netflix Tech Blog and the @NetflixOSS twitter feed for more open source components of the Netflix Cloud Platform.

If you're interested in working with us to solve more of these interesting problems, have a look at the Netflix jobs page to see if something might suit you. We're hiring!

Related Resources

Asgard

Netflix Cloud Platform

Amazon Web Services

Scalable Logging

by Kedar Sadekar

At Netflix we work hard to improve personalized recommendations. We use a lot of data to make recommendations better. What may seem an arbitrary action -- scrolling up, down, left or right and how much -- actually provides us with valuable information. We work to get all the necessary data points and feedback to provide the best user experience.

It is obvious that to capture the large amount of data generated, we need a dedicated, fast, scalable and highly available and asynchronous collection system that does not slow the user experience.

In this post we discuss the decisions and considerations that went into building a service that accepts a few billion requests a day, processing and storing these requests for later use and analysis by various systems within Netflix.

Considerations

We did not want this service to disrupt the user experience, hence, the main objective was as low a latency as possible. It also needed to scale to handle billions of requests a day. The data sent to and processed by this service is noncritical data. That was an important factor in our design where we made a conscious choice of being ok with dropping data (user events) as opposed to providing a sub-optimal client experience. From the client side, the call is fire-and-forget. That essentially means that the client should not care what the end result of the call was (success/failure).

Data Size

The average size of the request and the logged data is around 16 KB (range: 800 bytes ~ 130 KB) whereas the response average is pretty consistent at around 512 bytes. Here is an example of the data (fields) that we capture: video, device, page, timestamp.

Latency

The service needs to handle a billion plus requests a day, and peak traffic could be 3 - 6 times the average when measured in terms of requests per second (RPS). To achieve our goal of having a low millisecond latency for this service. Here are some of the practises we adopted:

Holding on to the request is expensive

This service is developed using Java, deployed on a standard Tomcat container. To achieve a high throughput, we want to free up Tomcat threads as soon as we can. To do that we do not hold on to the request for any longer than required. The methodology we used was simple, we grab whatever data we need from the HTTP request object, push it onto a thread pool for processing later and flush the response to the client immediately. Holding on to the request for any longer translates to a smaller throughput per node in the cluster. A lower throughput per node in the cluster means having to scale more horizontally and scaling horizontally beyond a point is inefficient and cost ineffective.

Fail fast / first

Return as quickly as you can, which means you try to identify your failure cases first, before doing any unnecessary processing. Return as soon as you know there is no point moving forward.

An example: If your data must have some data from the cookie, try to crack the cookie first, before dealing with any other request parameters. If the cookie does not have the required data, return, don’t bother looking at any other data the request body contains.

HTTP codes

We captured all the 4xx / 5xx / 2xx responses that we serve. Some services don’t care about a failure, in those cases, we just returned a HTTP 202 (accepted) response. Having these metrics in place helps you tune your code, and if the calling service does not care, why bother returning a 4xx response. We have alert triggering mechanisms based on the percentage of the HTTP response codes.

Dependencies can and will slow down sometimes

We did an exercise to identify all dependencies (other Netflix services / jars) that this service depended on which were going to make across the wire calls. We have learned that however reliable and robust the dependencies are, there will be network glitches and service latency issues at some point or another. We do not want the logging service to be bogged down by such issues,

For any such service calls, we guard them by wrapping them using Java Futures with appropriate timeouts. Aggressive timeouts were specially reserved for those calls that were in the hot path (before the response is flushed). Adding a lot of metrics helped in understanding if a service was timing out too often or was the slowest.

Process Later

Once we had all the data we needed, we put into a queue for asynchronous execution by an executor pool.

The following diagram illustrates what has been described above.

Garbage Collection

For a service written entirely in Java, an important factor when deploying is pause times during Garbage Collections. The nature of this service is an extremely large volume of really short-lived objects. We played around with GC tuning variables to achieve the best throughput. As part of these experiments, we tried various combinations of the parallel generational collector and the CMS (Concurrent Mark Sweep) collector too. We setup canaries taking peak production traffic for at least a couple of days with different combinations for young gen to heap ratios.

Each time we had a winner, we pitted the CMS canary against the best canary with the parallel collector. We did this 2-3 times until we were sure we had a winner.

The winner was analyzed by capturing the GC logs and mining them for timings and counts of new gen (par-new), Full GC’s and CMS failures (if any) etc. We learned that having canaries is the only way of knowing for sure. Don’t be in a hurry to pick a winner.

Auto Scaling

Measure, Measure

Since traffic (rps) is unpredictable, at Netflix heavily leverage auto-scaling policies. There are different metrics that one could use to auto-scale a cluster, the most common ones being CPU load and RPS. We chose to primarily use RPS. CPU load is used to trigger alerts both at instance and cluster levels. A lot of the metrics gathered are powered by our own Servo code (available on github).

We have collected the metrics over a few days, including peak traffic at weekends and then applied the policies that enable us to effectively scale in the cloud. [See reference on auto-scaling]

Have Knobs

All these throughput measurements were done in steps. We had knobs in place that allowed us to slowly ramp-up traffic, observe the system behavior and make necessary changes, gaining confidence in what the system could handle.

Here is a graph showing the RPS followed by a graph showing the average latency metrics (in milliseconds) over the same period.

Persistence

The real magic of such voluminous data collection and aggregation is actually done by our internal log collectors. Individual machines have agents that send the logs to collectors and finally to the data sink (Hive for example).

Common Infrastructure / Multiple end-points

As different teams within Netflix churn out different features and algorithms, the need to measure the efficacy and success of those never diminishes. However, those teams would love to focus on their core competencies instead of having to setup up a logging / tracking infrastructure that caters to their individual needs.

It made perfect sense for those teams to direct their traffic to the logging service. Since the data required by each team is disparate, each of these teams’ needs is considered as a new end-point on the logging service.

Supporting a new client is simple, with the main decision being whether the traffic warrants an independent cluster or can be co-deployed with a cluster that supports other end-points.

When a single service exposes multiple end-points with hundreds of millions of requests a day per end point, we needed to decide between just scaling horizontally forever or break it down into multiple clusters by functionality. There are pros / cons of doing it either way. Here are a few:

Pros of single cluster

- Single deployment

- One place to manage / track

Pros of multiple deployment

- Failure in one end-point does not affect another, especially in internal dependencies

- Ability to independently scale up/down volume

- Easier to debug issues

Conclusion

As the traffic was ramped up, we have been able to scale up very comfortably so far learning, lessons as we went along.

Data is being analyzed multiple ways by our algorithmic teams. For example - which row types (Top 10, most recently watched etc.) did most plays emanate from. How did that vary by country and device. How far did users scroll left / right across devices - and do users ever go beyond a certain point. These and many other data points are being examined to improve our algorithms to provide users with a better viewing experience.

References

Servo :

https://github.com/Netflix/servo/wiki ,

http://techblog.netflix.com/2012/02/announcing-servo.html

Auto-scaling:

http://techblog.netflix.com/search/label/autoscaling

Join Us

Like what you see and want to work on bleeding edge performance and scale?

http://jobs.netflix.com/jobsListing.html?id=oEK2Vfw5

by Kedar Sadekar, Senior Software Engineer, Product Infrastructure Team

by Greg Orzell & Ariel Tseitlin

Overview

On Friday, June 29th, we experienced one of the most significant outages in over a year. It started at about 8 PM Pacific Time and lasted for about three hours, affecting Netflix members in the Americas. We’ve written frequently about our resiliency efforts and our experience with the Amazon cloud. In the past, we’ve been able to withstand Amazon Web Services (AWS) availability zone outages with minimal impact. We wanted to take this opportunity to share our findings about why this particular zone outage had such an impact.

For background, you can read about Amazon’s root-cause analysis of their outage here: http://aws.amazon.com/message/67457/. The short version is that one of Amazon’s Availability Zones (AZs) failed on Friday evening due to a power outage that was caused by a severe storm. Power was restored 20 minutes later. However, the Elastic Load Balancing (ELB) service suffered from capacity problems and an API backlog, which slowed recovery.

Our own root-cause analysis uncovered some interesting findings, including an edge-case in our internal mid-tier load-balancing service. This caused unhealthy instances to fail to deregister from the load-balancer which black-holed a large amount of traffic into the unavailable zone. In addition, the network calls to the instances in the unavailable zone were hanging, rather than returning no route to host.

As part of this outage we have identified a number of things that both we and Amazon can do better, and we are working with them on improvements.

Middle-tier Load Balancing

In our middle tier load-balancing, we had a cascading failure that was caused by a feature we had implemented to account for other types of failures. The service that keeps track of the state of the world has a fail-safe mode where it will not remove unhealthy instances in the event that a significant portion appears to fail simultaneously. This was done to deal with network partition events and was intended to be a short-term freeze until someone could investigate the large-scale issue. Unfortunately, getting out of this state proved both cumbersome and time consuming, causing services to continue to try and use servers that were no longer alive due to the power outage

Gridlock

Clients trying to connect to servers that were no longer available led to a second-order issue. All of the client threads were taken up by attempted connections and there were very few threads that could process requests. This essentially caused gridlock inside most of our services as they tried to traverse our middle-tier. We are working to make our systems resilient to these kinds of edge cases. We continue to investigate why these connections were timing out during connect, rather than quickly determining that there was no route to the unavailable hosts and failing quickly.

Summary

Netflix made the decision to move from the data center to the cloud several years ago [1]. While it’s easy and common to blame the cloud for outages because it’s outside of our control, we found that our overall availability over the past several years has steadily improved. When we dig into the root-causes of our biggest outages, we find that we can typically put in resiliency patterns to mitigate service disruption.

There were aspects of our resiliency architecture that worked well:

Regional isolation contained the problem to users being served out of the US-EAST region. Our European members were unaffected.
Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability.
Chaos Gorilla, the Simian Army member tasked with simulating the loss of an availability zone, was built for exactly this purpose. This outage highlighted the need for additional tools and use cases for both Chaos Gorilla and other parts of the Simian Army.

The state of the cloud will continue to mature and improve over time. We’re working closely with Amazon on ways that they can improve their systems, focusing our efforts on eliminating single points of failure that can cause region-wide outages and isolating the failures of individual zones.

We take our availability very seriously and strive to provide an uninterrupted service to all our members. We’re still bullish on the cloud and continue to work hard to insulate our members from service disruptions in our infrastructure.

We’re continuing to build up our Cloud Operations and Reliability Engineering team, which works on exactly the types of problems identified above, as well as each service team to deal with resiliency. Take a look at jobs.netflix.com for more details and apply directly or contact @atseitlin if you’re interested.

[1] http://techblog.netflix.com/2010/12/four-reasons-we-choose-amazons-cloud-as.html

As I discussed in my recent blog post on ProgrammableWeb.com, Netflix has found substantial limitations in the traditional one-size-fits-all (OSFA) REST API approach. As a result, we have moved to a new, fully customizable API. The basis for our decision is that Netflix's streaming service is available on more than 800 different device types, almost all of which receive their content from our private APIs. In our experience, we have realized that supporting these myriad device types with an OSFA API, while successful, is not optimal for the API team, the UI teams or Netflix streaming customers. And given that the key audiences for the API are a small group of known developers to which the API team is very close (i.e., mostly internal Netflix UI development teams), we have evolved our API into a platform for API development. Supporting this platform are a few key philosophies, each of which is instrumental in the design of our new system. These philosophies are as follows:

Embrace the Differences of the Devices
Separate Content Gathering from Content Formatting/Delivery
Redefine the Border Between "Client" and "Server"
Distribute Innovation

I will go into more detail below about each of these, including our implementation and what the benefits (and potential detriments) are of this approach. However, each philosophy reflects our top-level goal: to provide whatever is best for the Netflix customer. If we can improve the interaction between the API and our UIs, we have a better chance of making more of our customers happier.

Now, the philosophies…

Embrace the Differences of the Devices

The key driver for this redesigned API is the fact that there are a range of differences across the 800+ device types that we support. Most APIs (including the REST API that Netflix has been using since 2008) treat these devices the same, in a generic way, to make the server-side implementations more efficient. And there is good reason for this approach. Providing an OSFA API allows the API team to maintain a solid contract with a wide range of API consumers because the API team is setting the rules for everyone to follow.

While effective, the problem with the OSFA approach is that its emphasis is to make it convenient for the API provider, not the API consumer. Accordingly, OSFA is ignoring the differences of these devices; the differences that allow us to more optimally take advantage of the rich features offered on each. To give you an idea of these differences, devices may differ on:

Memory capacity or processing power, potentially modifying how much content it can manage at a given time
Requirements for distinct markup formats and broader device proliferation increases the likelihood of this
Document models, some devices may perform better with flatter models, others with more hierarchical
Screen real estate which may impact the content elements that are needed
Document delivery, some performing better with bits streamed across HTTP rather than delivered as a complete document
User interactions, which could influence the metadata fields, delivery method, interaction model, etc.

Our new model is designed to cut against the OSFA paradigm and embrace the differences across devices while supporting those differences equally. To achieve this, our API development platform allows each UI team to create customized endpoints. So the request/response model can be optimized for each team’s UIs to account for unique or divergent device requirements. To support the variability in our request/response model, we need a different kind of architecture, which takes us to the next philosophy...

Separate Content Gathering from Content Formatting/Delivery

In many OSFA implementations, the API is the engine that retrieves the content from the source(s), prepares that payload, and then ultimately delivers it. Historically, this implementation is also how the Netflix REST API has operated, which is loosely represented by the following image.

Diagram showing Netflix UIs interacting with the Netflix REST API

The above diagram shows a rainbow of colors roughly representing some of the different requests needed for the PS3, as an example, to start the Netflix experience. Other UIs will have a similar set of interactions against the OSFA REST API given that they are all required by the API to adhere to roughly the same set of rules. Inside the REST API is the engine that performs the gathering, preparation and delivery of the content (indifferent to which UI made the request).

Our new API has departed from the OSFA API model towards one that enables fine-grained customizations without compromising overall system manageability. To achieve this model, our new architecture clearly separates the operations of content gathering from content formatting and delivery. The following diagram represents this modified architecture:

Diagram showing Netflix UIs interacting with the new optimized Netflix non-REST API

In this new model, the UIs make a single request to a custom endpoint that is designed to specifically handle that request. Behind the endpoint is a handler that parses the request and calls the Java API, which gathers the content by calling back to a range of dependent services. We will discuss in later posts how we do this, particularly in how we parse the requests, trigger calls to dependencies, handle concurrency, support fallbacks, as well as other techniques we use to ensure optimized and accurate gathering of the content. For now, though, I will just say that the content gathering from the Java API is generic and independent of destination, just like the OSFA approach.

After the content has been gathered, however, it is handed off to the formatting and delivery engines which sit on top of the Java API on the server. The diagram represents this layer by showing an array of different devices resting on top of the Java API, each of which corresponds to the custom endpoints for a given UI and/or set of devices. The custom endpoints, as mentioned earlier, support optimized request/response handling for that device, which takes us to the next philosophy...

Redefine the Border Between "Client" and "Server"

The traditional definition of "client code" is all code that lives on a given device or UI. "Server code" is typically defined as the code that resides on the server. The divide between the two is the network border. This is often the case for REST APIs and that border is where the contract between the API provider and API consumer is engaged, as was the case for Netflix’s REST API, as shown below:

Diagram showing the traditional border between client and server code in REST APIs

In our new approach, we are pushing this border back to the server, and with it goes a substantial portion of the UI-specific content processing. All of the code on the device is still considered client code, but some client code now resides on the server. In essence, the client code on the device makes a network call back to a dedicated client adapter that resides on the server behind the custom endpoint. Once back on the server, the adapter (currently written in Groovy) explodes that request out to a series of server-side calls that get the corresponding content (in some cases, roughly the same rainbow of requests that would be handled across HTTP in our old REST API). At that point, the Java APIs perform their content gathering functions and deliver the requested content back to the adapter. Once the adapter has some or all of its content, the adapter processes it for delivery, which includes pruning out unwanted fields, error handling and retries, formatting the response, and delivering the document header and body. All of this processing is custom to the specific UI. This new definition of client/server is represented in the following diagram:

Diagram showing the modified border between client and server code in the optimized Netflix non-REST API

There are two major aspects to this change. First, it allows for more efficient interactions between the device and the server since most calls that otherwise would be going across the network can be handled on the server. Of course, network calls are the most expensive part of the transaction, so reducing the number of network requests improves performance, in some cases by several seconds. The second key component leads us to the final (and perhaps most important) philosophy to this approach, which is the distribution of the work for building out the optimized adapters.

Distribute Innovation

One expected critique with this approach is that as we add more devices and build more UIs for A/B and multivariate tests, there will undoubtedly be myriad adapters needed to support all of these distinct request profiles. How can we innovate rapidly and support such a diverse (and growing) set of interactions? It is critical for us to support the custom adapters, but it is equally important for us to maintain a high rate of innovation across these UIs and devices.

Example of how this new system works:

A device, such as the PS3, makes a single request across the network to load the home screen (This code is written and supported by the PS3 UI team)
A Groovy adapter receives and parses the PS3 request (PS3 UI team)
The adapter explodes that one request into many requests that call the Java API to (PS3 UI team)
Each Java API calls back to a dependent service, concurrently when appropriate, to gather the content needed for that sub-request (API team)
In the Java API, if a dependent service unavailable or returns a 4xx or 5xx, the Java API returns a fallback and/or an error code to the adapter (API team)
Successful Java API transactions then return the content back to the adapter when each thread has completed (API team)
The adapter can handle the responses from each thread progressively or all together, depending on how the UI team wants to handle it (PS3 UI team)
The adapter then manipulates the content, retrieves the wanted (and prunes out the unwanted) elements, handle errors, etc. (PS3 UI team)
The adapter formats the response in preparation for delivery back across the network to the PS3, which includes everything needed for the PS3 home screen in the single payload (PS3 UI team)
The adapter finally handles the delivery of the payload across the network (PS3 UI team)
The device will then parse this optimized response and populate the UI (PS3 UI team)

As described above, pushing some of the client code back to the servers and providing custom endpoints gives us the opportunity to distribute the API development to the UI teams. We are able to do this because the consumers of this private API are the Netflix UI and device teams. Given that the UI teams can create and modify their own adapter code (potentially without any intervention or involvement from the API team), they can be much more nimble in their development. In other words, as long as the content is available in the Java API, the UI teams can change the code that lives on the device to support the user experience and at the same time change the adapter code to deliver the payload needed for that experience. They are no longer bound by server teams dictating the rules and/or being a bottleneck for their development. API innovation is now in the hands of the UI teams! Moreover, because these adapters are isolated from each other, this approach also diminishes the risk of harming other device implementations with tactical changes in their device-specific APIs.

Of course, one drawback to this is that UI teams are often more skilled in technologies like HTML5, CSS3, JavaScript, etc. In this system, they now need to learn server-side technologies and techniques. So far, however, this has been a relatively small issue, especially since our engineering culture is to hire very strong, senior-level engineers who are adaptable, curious and passionate about learning and implementing these kinds of solutions. Another concern is that because the UI teams are implementing server-side adapters, they have the potential to bring down the servers through infinite loops or other processes that are resource intensive. To offset this, we are working on scrubbing engines that will hopefully minimize the likelihood of such mistakes. That said, in the OSFA world, code on the device can just as easily DDOS the server, it is just potentially a bigger problem if it runs on the server.

We are still in the early stages of this new system. Some of our devices have fully migrated over to it, others are split between it and the REST API, and others are just getting their feet wet. In upcoming posts, we will share more about the deeper technical aspects of the system, including the way we handle concurrency, how we manage the adapters, the interaction between the adapters and the Java API, our Groovy implementation, error handling, etc. We will also continue to share the evolution of this system as we learn more about it.

In the meantime, if you are interested in building high-scale, cloud-based solutions such as this one, we are hiring!

Daniel Jacobson (@daniel_jacobson)
Director of Engineering – Netflix API

By Ruslan Meshenberg

At Netflix we use a wide range of Open Source technologies. In the recent months, we also released many of our internally developed components and libraries, starting with Curator for Zookeeper, and most recently with Asgard.

We started down this path by becoming a big user of Apache licensed open source software. When we picked Apache Cassandra as our data storage solution we started to contribute fixes and extensions to optimize Cassandra’s capabilities on AWS. This led to us see the benefits of releasing our own projects and we created a central Netflix account at netflix.github.com as a home for them.

There are many reasons why we’re opening up much of our software. To highlight some of them:

We have benefited from many other people contributing to open source, so we are paying back in kind.
Netflix was an early cloud adopter, moving all of our streaming services to run on top of AWS infrastructure. We paid the pioneer tax – by encountering and working through many issues, corner cases and limitations. We’ve captured the patterns that work in our platform components and automation tools. We benefit from the scale effects of other AWS users adopting similar patterns, and will continue working with the community to develop the ecosystem.
External community contributions - by opening up we enable the larger developer community to: review, comment, add test cases, bug fixes, ports and functional contributions to our components, benefiting everyone.
Improved code and documentation quality – we’ve observed that the peer pressure from “Social Coding” has driven engineers to make sure code is clean and well structured, documentation is useful and up to date. What we’ve learned is that a component may be “Good enough for running in production, but not good enough for Github”.
Durability – we think any code will fare better over time if it’s actively developed by an open community and used widely vs. maintained by small number of engineers for a single workload.

Brief overview of components we open sourced so far:

For Zookeeper:

Curator - Client wrapper and rich Zookeeper framework.
Exhibitor - Co-process for instance monitoring, backup/recovery, cleanup and visualization.

For Cassandra:

Astyanax - High level, simple object oriented client for Cassandra.
Priam - Co-process for backup/recovery, Token Management and centralized Configuration management for Cassandra.
Jmeter Plugin for Cassandra - automation for running Cassandra tests.

Netflix platform and tools:

Autoscaling scripts - Tools and documentation about using Auto Scaling part of AWS services.
Archaius - library for managing dynamic configuration properties.
Asgard - Web interface for application deployment and cloud management in AWS.

So far we’ve seen great community response and feedback on our Open Source efforts. Many companies are already using components listed above, and many others are evaluating and integrating them into their own software stack. We hope you find what we’ve opened so far useful. Many great libraries and components are coming soon, stay tuned! You can follow @NetflixOSS for news and updates. You can let us know whether you're using our tools and send any feedback via one of our Mailing Lists.

If you’re interested in contributing to these and other great technologies at Netflix, check out jobs.netflix.com

Why do we need Caching?

What is EVCache?

Features

Under the Hood

Single Zone Deployment

Multi-Zone Deployment

Case Study : Movie and TV show similarity

Metrics, Monitoring, and Administration

Join Us

References

What is Astyanax?

Astyanax API

Recipes

Connection pool

A taste of Astyanax

Features

Registering Metrics

Publishing Metrics

Related Links

Cassandra JMeter Plugin

Benchmark Setup

Results

Transaction throughput

Average Latency

What is Archaius?

Why Archaius?

Use Cases/Examples

Netflix Deployment Overview

Example Deployment Context

Overview of Archaius

References

Conclusion

Ranking

Data and Models

Consumer Data Science

Conclusion

Visual Language for the Cloud

Cloud Model

Application

Cluster

Deployment Methods

Fast Rollback

Rolling Push

Task Automation

Auto Scaling

Why not the AWS Management Console?

Hide the Amazon keys

Auto Scaling Groups

Enforce Conventions

Logging

Integrate Systems

Automate Workflow

Simplify REST API

Costs

Feature Films

Conclusion

Related Resources

Asgard

Netflix Cloud Platform

Amazon Web Services

by Kedar Sadekar

Considerations

Data Size

Latency

Holding on to the request is expensive

Fail fast / first

HTTP codes

Dependencies can and will slow down sometimes

Process Later

Garbage Collection

Auto Scaling

Measure, Measure

Have Knobs

Conclusion

References

Join Us

Embrace the Differences of the Devices

Separate Content Gathering from Content Formatting/Delivery

Redefine the Border Between "Client" and "Server"

Distribute Innovation