Today, we are excited to bring you Pytheas : web resource and rich UI framework. This piece of software is heavily used at Netflix in building quick prototypes and web applications that explore/visualize large data sets.

Pytheas integrates Guice and Jersey frameworks to wire REST web-service endpoints together with dynamic UI controls in a web application. The framework is designed to support the most common web UI components needed to build data exploration / dashboard style applications. It not only serves as a quick prototyping tool, but also acts as a foundation for integrating multiple data sources in a single application.

UI components bundle

The UI library bundled with Pytheas is based on a number of Javascript Open Source frameworks such as Bootstrap, JQuery-UI, DataTables, D3 etc. It also contains a number of JQuery plugins that we wrote to support specific use cases that we encountered in building Netflix internal applications/ dashboards. Some of the plugins include support for ajax data driven selection boxes with dynamic filter control, pop-over dialog box form templates, inline portlets, breadcrumbs, loading spinner etc.

Modular Design

An application based on Pytheas framework consists of one or more Pytheas modules. Each module is loosely coupled from each other. The module is responsible for supplying its own data resources and in fact can also provide its own rendering mechanism. Each data resource is a Jersey REST endpoint owned by the module.

By default Pytheas uses FreeMarker as the rendering template engine for each resource. The framework provide a library of reusable FreeMarker macros that can be embedded in a page to allow for rendering commonly used UI components. Each Pytheas module gets access to all the common page building blocks such as page layout containers, header, footer, navbar etc. which gets embedded by the framework.

Although Pytheas provides FreeMarker as the default template engine, the framework allows for plugging in your own template engine for each module. It'll need to supply its own Jersey Provider with it.

Getting Started

Pytheas project contains a simple helloworld application that serves as a template for building new applications using the framework. Please refer to instructions on how to run helloworld application from a command line.

What is garbage collection visualization?

By Brian Moore

In short, “garbage collection visualization” (hereafter shortened to “gcviz”) is the process of turning gc.log[1] into x/y time-series scatter plots (pictures). That is, turning garbage collector (GC) logging output into two types of charts, GC event charts:

and heap size charts:

A GC event occurs when one of the JVM-configured collectors operates on the heap, and a heap size chart shows the size of the heap after a GC event. Both of these types of information are present in gc.log[1].

Setting the context for gcviz

Earlier, Shobana wrote about the metadata that Video Metadata Services (VMS) maintains in-memory for clients to access along extremely low-latency (microseconds), high-volume (10^11 daily requests), large-data (gigabytes) code paths. Being able to serve-up data in this setting requires a deep understanding of Java garbage collection.

Netflix is (mostly) a Java shop. Netflix deploys Java applications into Apache Tomcat, an application server (itself written in Java). Tomcat, the Netflix application, and all of the libraries that are dependencies of the Netflix application allocate from the same heap. Netflix Java applications are typically long-running (weeks/months) and have large heaps (typically tens of gigabytes).

At this scale, the overhead required to manage the heap is significant, and garbage collection pauses that last more than 500 milliseconds will almost certainly interfere with network activity.

Any GC event that “stops the world” (STW) will pause Tomcat, the application and the libraries the application needs to run. New inbound connections cannot be established (the Tomcat accept thread is blocked) and I/O on outbound connections will stall until the GC completes (each java thread in a STW event is at a safepoint and unable to accomplish meaningful work). It is therefore important to ensure that any required GC pauses are kept as small as possible to ensure that an application remains available. GC pauses are often seen as the cause of an issue when in fact they are a side-effect of the allocation behavior of the combination of Tomcat, the application and the libraries the application uses.

Before gcviz was written, Netflix determined the influence (or absence) of GC events in outages via several methods:

hand-crafting excel spreadsheets
using PrintGCStats/PrintGCFixup/GCHisto
visually skimming gc.log, looking for “long” stop-the-world events

While these methods were effective, they were time-consuming and difficult for many Netflix developers. Inside Netflix, gcviz has been incorporated into Netflix’s base AMI and integrated into Asgard, making visual GC event analysis of any application trivial (a click of a button) for all Netflix developers.

Why is gcviz important?

gcviz is important for several reasons:

Convenient: The developer loads the event and heap chart pages directly into their browser by clicking on a link in Asgard. gcviz does all the processing behind the scenes. This convenience allows us to quickly reinforce or reject the oft-repeated claim: bad application behavior is "because of long GC pauses".

Visual: Images can convey time-series information more quickly and densely than text.
Causation: gcviz can help application developers establish or refute a correlation between time-of-day-based observed application degradation and JVM-wide GC behavior.
Clarity: The semantics of each of the different GC event types (concurrent-mark, ParNew) can be made implicit in the image instead of being given equal weight in the text of gc.log. This is useful because, for example, a long running concurrent-sweep should be no cause for alarm because it runs in parallel with the application program.
Iterative: gcviz allows for quick interpretation of experimental results. What effect does modifying a particular setting or algorithm have on the heap usage or GC subsystem? gcviz allows one to quickly and visually understand any impact a GC-related change has made. gcviz is now being effectively used in canary deployments within Netflix where GC behavior is compared to an existing baseline.

Prior (and contemporary) Art

Netflix is not the first organization that has seen the benefits of visualizing garbage collection data. Other projects involved in the visualization or interpretation of garbage collection data include the following:

AppDynamics http://www.appdynamics.com/
GarbageCat http://code.google.com/a/eclipselabs.org/p/garbagecat/
GCHistohttp://java.net/projects/gchisto https://gchisto.dev.java.net/
gcviewer https://github.com/chewiebug/GCViewer
GCViewer http://www.tagtraum.com/gcviewer.html
HP jmeter https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=HPJMETER
IBM GCMV http://www.ibm.com/developerworks/java/jdk/tools/gcmv/
IBM pmat https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=22d56091-3a7b-4497-b36e-634b51838e11
jClarity Censum http://www.jclarity.com/products/censum/
jvisualvm http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/index.html
PrintGCFixup http://article.gmane.org/gmane.comp.java.openjdk.hotspot.gc.devel/51/match=gc+log+reader
PrintGCStats http://java.net/projects/printgcstats/
Shrek: http://www.slideshare.net/arunkejariwal/a-tool-for-practical-garbage-collection-analysis-in-the-cloud
verbosegcanalyzer http://code.google.com/p/verbosegcanalyzer/

Why develop another solution?

While the tools mentioned above work well under their own conditions, Netflix had other requirements and constraints to consider and realized that none of the existing tools met all of Netflix’s needs. The following requirements were considered and followed during the design and implementation of gcviz. gcviz needs to:

Run outside the context of the application under analysis, but on each instance on which the application runs.
Use gc.log as the source-of-truth

historical analysis (no need to attach to a running vm)
filesystem/logs can be read even when HTTP port is busy and/or application threads are blocked

Read gc logs with the JVM options that Netflix commonly (but not universally) uses:

-verbose:gc
-verbose:sizes
-Xloggc:/apps/tomcat/logs/gc.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-XX:-UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+UseConcMarkSweepGC
-XX:+CMSScavengeBeforeRemark
-XX:+CMSParallelRemarkEnabled

Gracefully handle any arguments passed to the JVM.
Remain independent of any special terminal/display requirements (must be able to run inside and outside Netflix’s base AMI without modification)

Be able to run standalone to leverage special display capabilities if they are present
Be accessible from Asgard
Be dead-simple to use, requiring no specialized gc knowledge
Correlate netflix-internal events with gc activity. One example of a netflix-internal event is a cache refresh.
Retain its reports over time to enable long-term comparative analysis of a given application.

A Small Example: GC Event Chart

To understand how the GC event charts are constructed, it may be helpful to consider an example. In the picture below, gcviz visualizes a gc.log that contains three gc events:

a DefNew GC event at 60 seconds after JVM boot,
a DefNew GC event at 120 seconds after JVM boot, and
a DefNew GC event at 180 seconds after JVM boot

These three events took 100 milliseconds, 200 milliseconds, and 100 milliseconds, respectively. These three events would produce a chart with three DefNew “dots” on it. The “x” value of the dot would be the seconds since JVM boot, converted to absolute time, and the “y” value of the dot would be the duration in seconds. The red color of the dot indicates that DefNew is a “stop-the-world” GC event. Hopefully this is more clear in pictorial form:

Real-world GC Event Charts

In mid-2012, Netflix used gcviz to analyze some performance problems a class of applications was having. These performance problems had 80-120 second garbage collection pauses as one of its symptoms. The gcviz event chart looked like this:

After the problem was identified and fixed the event chart looked like this:

The pauses had been reduced from 120 seconds to under 5 seconds. In addition, any Netflix engineer could quickly see that the symptom had been eliminated without needing to dig through gc.log.

Real-world Heap Size Charts

In another mid-2012 event, the heap usage of a Netflix application went higher than the designers of that application intended. The application designers intended memory usage to peak around 15GB. The chart below shows that heap usage peaked at 2.0 times 10^7 kilobytes (19.07 gigabytes) and it also shows “flumes” for periods of time where the heap usage went above its previous/standard high-water mark:

after that problem was identified and fixed the high-water mark returned to 1.6 times 10^7 kilobytes (15.23 gigabytes) and the flumes reduced and the heap usage pattern became far more regular. In addition the low-water mark dropped from 11.44 gigabytes to 9.54 gigabytes:

JVM collector compatibility

Currently, gcviz supports all of the HotSpot/JDK 7 collectors, with the exception of G1.

Additionally collected data

In addition to collecting gc.log data, gcviz collects additional system information (cpu usage, network usage, underlying virtual memory usage, etc.) to help correlate gc events with application events. In addition gcviz can be configured to capture a jmap histogram of live objects by object count, bytes required and class name.

Open-sourcing details

gcviz has been open sourced under the Apache License, version 2.0 at https://github.com/Netflix/gcviz

Conclusion

As a company, Netflix considers data visualization of paramount importance. Most of Netflix’s major systems contain significant visualization components (for example, the Hystrix Dashboard and Pytheas). Most visualization at Netflix occurs on continuous data, but visualizing discrete data has its place too. Being able to quickly differentiate between a GC-indicated allocation problem and other types of error conditions has been valuable in operating the many services required to bring streaming TV and movies into living rooms all across the world.

Notes

[1] I didn’t want to lose you by quoting gc.log so early in the blog post! gc.log is a file created by the JVM option (for example) -Xloggc:/apps/tomcat/logs/gc.log. Specifying this option is recommended and common inside Netflix. Its contents are something like the following:

2013-01-01T18:30:16.651+0000: 2652735.877: [CMS-concurrent-sweep-start]

2013-01-01T18:30:21.777+0000: 2652741.003: [CMS-concurrent-sweep: 5.126/5.126 secs] [Times: user=5.13 sys=0.00, real=5.12 secs]

2013-01-01T18:30:21.777+0000: 2652741.004: [CMS-concurrent-reset-start]

2013-01-01T18:30:21.842+0000: 2652741.068: [CMS-concurrent-reset: 0.065/0.065 secs] [Times: user=0.06 sys=0.00, real=0.07 secs]

2013-01-01T19:28:47.041+0000: 2656246.267: [GC 2656246.267: [ParNewDesired survivor size 786432000 bytes,new threshold 15 (max 15)- age 1: 26395600 bytes, 26395600 total- age 2: 1376 bytes, 26396976 total- age 3: 4184 bytes, 26401160 total- age 4: 9591072 bytes, 35992232 total- age 5: 747344 bytes, 36739576 total- age 6: 18239512 bytes, 54979088 total- age 7: 7398216 bytes, 62377304 total- age 8: 4702664 bytes, 67079968 total- age 9: 5584 bytes, 67085552 total- age 10: 3728 bytes, 67089280 total- age 11: 2416 bytes, 67091696 total- age 12: 10838496 bytes, 77930192 total- age 13: 1682368 bytes, 79612560 total- age 14: 17756736 bytes, 97369296 total- age 15: 6775352 bytes, 104144648 total: 6103985K->124729K(10752000K), 0.1872850 secs] 11541109K->5564874K(29184000K), 0.1874910 secs] [Times: user=0.72 sys=0.00, real=0.19 secs]

2013-01-01T19:28:47.229+0000: 2656246.456: [GC [1 CMS-initial-mark: 5440145K(18432000K)] 5583733K(29184000K), 0.1454260 secs] [Times: user=0.15 sys=0.00, real=0.15 secs]

2013-01-01T19:28:47.375+0000: 2656246.602: [CMS-concurrent-mark-start]

2013-01-01T19:29:02.574+0000: 2656261.800: [CMS-concurrent-mark: 15.195/15.199 secs] [Times: user=15.24 sys=0.03, real=15.19 secs]

2013-01-01T19:29:02.574+0000: 2656261.801: [CMS-concurrent-preclean-start]

2013-01-01T19:29:02.638+0000: 2656261.864: [CMS-concurrent-preclean: 0.061/0.064 secs] [Times: user=0.06 sys=0.00, real=0.07 secs]

2013-01-01T19:29:02.638+0000: 2656261.865: [CMS-concurrent-abortable-preclean-start] CMS: abort preclean due to time 2013-01-01T19:29:08.589+0000: 2656267.816: [CMS-concurrent-abortable-preclean: 5.946/5.951 secs] [Times: user=5.96 sys=0.01, real=5.95 secs]

By Michael Fu and Cory Bennett, Engineering Tools

Cloud computing makes it much easier to launch new applications or start new instances. At Netflix, engineers can easily launch a new application in Asgard with a few clicks. With this freedom there are sometimes consequences where launched applications or instances may not follow some best practices. This can happen when an engineer isn't familiar with best practices or when those practices have not been well publicized. For example, some required security groups may be missing from instances and can cause security gaps. Or perhaps a health check url is not defined for instances in Eureka which would result in automatic failure detection and failover being disabled.

Introducing the Conformity Monkey

At Netflix, we use Conformity Monkey, another member of Simian Army, to check all instances in our cloud for their conformity. Today, we are proud to announce that the source code for Conformity Monkey, is now open and available to the public.

What is Conformity Monkey?

Conformity Monkey is a service which runs in the Amazon Web Services (AWS) cloud looking for instances that are not conforming to predefined rules for the best practices. Similar to Chaos Monkey and Janitor Monkey, the design of Conformity Monkey is flexible enough to allow extending it to work with other cloud providers and conformity rules. By default, conformity check is performed every hour. The schedule can be easily re-configured to fit your business' need.

Conformity Monkey determines whether an instance is nonconforming by applying a set of rules on it. If any of the rules determines that the instance is not conforming, the monkey sends an email notification to the owner of the instance. We provide a collection of conformity rules in the open sourced version that are currently used at Netflix and believed general enough to be used by most users. The design of Conformity Monkey also makes it simple to customize rules or to add new ones.

There can be exceptions when you want to ignore warnings of a specific conformity rule for some applications. For example, a security group to open a specific port probably is not needed by instances of some applications. We allow you to customize the set of conformity rules to be applied to a cluster of instances by excluding unneeded ones.

How Conformity Monkey Works

Conformity Monkey works in two stages: mark and notify. First, Conformity Monkey loops through all autoscaling groups in your cloud and applies the specified set of conformity rules to the instances in each group. If any conformity rule determines an instance as not conforming, the autoscaling group is marked as nonconforming and the instances that break the rule are recorded. Every autoscaling group is associated with an owner email, which can be obtained from an internal system, or can be set in a configuration file. The simplest way is using a default email address, e.g. your team's email list for all the autoscaling groups. Conformity Monkey sends email notification about the nonconforming groups to the owner, with the details of the broken conformity rule and the instances that failed the conformity check. The application owners can then take necessary actions to fix the failed instances or to exclude the conformity rule if they believe the conformity check is not necessary for the application. We allow you to set different frequency for conformity check and notification. For example, at Netflix, conformity check is performed every hour, and notification is only sent once per day at noon time. This reduces the number of emails people receive about the same conformity warning. The real-time result of conformity check for every autoscaling group is shown in a separate UI.

Configuration and Customization

The conformity rules for each autoscaling group, and the parameters used to configure each individual rule, are all configurable. You can easily customize Conformity Monkey with the most appropriate set of rules for your autoscaling groups by setting Conformity Monkey properties in a configuration file. You can also create your own rules, and we encourage you to contribute your conformity rules to the project so that all can benefit.

Storage and Costs

Conformity Monkey stores its data in an Amazon SimpleDB table by default. You can easily check the SimpleDB records to find out the last conformity check results for your autoscaling groups. At Netflix we have a UI for showing the conformity check results and we have plans to open source it in the future as well.

There could be associated costs with Amazon SimpleDB, but in most cases the activity of Conformity Monkey should be small enough to fall within Amazon's Free Usage Tier. Ultimately the costs associated with running Conformity Monkey are your responsibility. For your reference, the costs of Amazon SimpleDB can be found at http://aws.amazon.com/simpledb/pricing/

Summary

Conformity Monkey helps keep our cloud instances following best practices. We hope you find Conformity Monkey to be useful for your business as well. We'd appreciate any feedback on it. We're always looking for new members to join the team. If you are interested in working on great open source software, take a look at jobs.netflix.com for current openings!

Conformity Monkey Links

Netflix Cloud Platform

Amazon Web Services

The Netflix streaming application is a complex array of intertwined systems that work together to seamlessly provide our customers a great experience. The Netflix API is the front door to that system, supporting over 1,000 different device types and handing over 50,000 requests per second during peak hours. We are continually evolving by adding new features every day. Our user interface teams, meanwhile, continuously push changes to server-side client adapter scripts to support new features and AB tests. New AWS regions are deployed to and catalogs are added for new countries to support international expansion. To handle all of these changes, as well as other challenges in supporting a complex and high-scale system, a robust edge service that enables rapid development, great flexibility, expansive insights, and resiliency is needed.

Today, we are pleased to introduce Zuul our answer to these challenges and the latest addition to our our open source suite of software Although Zuul is an edge service originally designed to front the Netflix API, it is now being used in a variety of ways by a number of systems throughout Netflix.

Zuul in Netflix's Cloud Architecture

How Does Zuul Work?

At the center of Zuul is a series of filters that are capable of performing a range of actions during the routing of HTTP requests and responses. The following are the key characteristics of a Zuul filter:

Type: most often defines the stage during the routing flow when the filter will be applied (although it can be any custom string)
Execution Order: applied within the Type, defines the order of execution across multiple filters
Criteria: the conditions required in order for the filter to be executed
Action: the action to be executed if the Criteria are met

Here is an example of a simple filter that delays requests from a malfunctioning device in order to distribute the load on our origin:

classDeviceDelayFilterextends ZuulFilter {

defstatic Random rand =new Random()

@Override
     String filterType(){
return'pre'
}

@Override
intfilterOrder(){
return5
}

@Override
booleanshouldFilter(){
return  RequestContext.getRequest().
     getParameter("deviceType")?equals("BrokenDevice"):false
}

@Override
    Object run(){
 sleep(rand.nextInt(20000))//Sleep for a random number of seconds
//between [0-20]
}
}

Zuul provides a framework to dynamically read, compile, and run these filters. Filters do not communicate with each other directly - instead they share state through a RequestContext which is unique to each request.

Filters are currently written in Groovy, although Zuul supports any JVM-based language. The source code for each filter is written to a specified set of directories on the Zuul server that are periodically polled for changes. Updated filters are read from disk, dynamically compiled into the running server, and are invoked by Zuul for each subsequent request.

Zuul Core Architecture

There are several standard filter types that correspond to the typical lifecycle of a request:

PRE filters execute before routing to the origin. Examples include request authentication, choosing origin servers, and logging debug info.
ROUTING filters handle routing the request to an origin. This is where the origin HTTP request is built and sent using Apache HttpClient or Netflix Ribbon.
POST filters execute after the request has been routed to the origin. Examples include adding standard HTTP headers to the response, gathering statistics and metrics, and streaming the response from the origin to the client.
ERROR filters execute when an error occurs during one of the other phases.

Request Lifecycle

Alongside the default filter flow, Zuul allows us to create custom filter types and execute them explicitly. For example, Zuul has a STATIC type that generates a response within Zuul instead of forwarding the request to an origin.

How We Use Zuul

There are many ways in which Zuul helps us run the Netflix API and the overall Netflix streaming application. Here is a short list of some of the more common examples, and for some we will go into more detail below:

Authentication
Insights
Stress Testing
Canary Testing
Dynamic Routing
Load Shedding
Security
Static Response handling
Multi-Region Resiliency

Insights

Zuul gives us a lot of insight into our systems, in part by making use of other Netflix OSS components. Hystrix is used to wrap calls to our origins, which allows us to shed and prioritize traffic when issues occur. Ribbon is our client for all outbound requests from Zuul, which provides detailed information into network performance and errors, as well as handles software load balancing for even load distribution. Turbine aggregates fine-grained metrics in real-time so that we can quickly observe and react to problems. Archaius handles configuration and gives the ability to dynamically change properties.

Because Zuul can add, change, and compile filters at run-time, system behavior can be quickly altered. We add new routes, assign authorization access rules, and categorize routes all by adding or modifying filters. And when unexpected conditions arise, Zuul has the ability to quickly intercept requests so we can explore, workaround, or fix the problem.

The dynamic filtering capability of Zuul allows us to find and isolate problems that would normally be difficult to locate among our large volume of requests. A filter can be written to route a specific customer or device to a separate API cluster for debugging. This technique was used when a new page from the website needed tuning. Performance problems, as well as unexplained errors were observed. It was difficult to debug the issues because the problems were only happening for a small set of customers. By isolating the traffic to a single instance, patterns and discrepancies in the requests could be seen in real time. Zuul has what we call a “SurgicalDebugFilter”. This is a special “pre” filter that will route a request to an isolated cluster if the patternMatches() criteria is true. Adding this filter to match for the new page allowed us to quickly identify and analyze the problem. Prior to using Zuul, Hadoop was being used to query through billions of logged requests to find the several thousand requests for the new page. We were able to reduce the problem to a search through a relatively small log file on a few servers and observe behavior in real time.

The following is an example of the SurgicalDebugFilter that is used to route matched requests to a debug cluster:

classSharpDebugFilterextends SurgicalDebugFilter {
privatestaticfinal Set<String> DEVICE_IDS =["XXX","YYY","ZZZ"]
@Override
booleanpatternMatches(){
final RequestContext ctx = RequestContext.getCurrentContext()
final String requestUri = ctx.getRequest().getRequestURI();
final String contextUri = ctx.requestURI;
       String id = HTTPRequestUtils.getInstance().
           getValueFromRequestElements("deviceId");
return DEVICE_IDS.contains(id);
}
}

In addition to dynamically re-routing requests that match a specified criteria, we have an internal system, built on top of Zuul and Turbine, that allows us to display a real-time streaming log of all matching requests/responses across our entire cluster. This internal system allows us to quickly find patterns of anomalous behavior, or simply observe that some segment of traffic is behaving as expected, (by asking questions such as "how many PS3 API requests are coming from Sao Paolo”)?

Stress Testing

Gauging the performance and capacity limits of our systems is important for us to predict our EC2 instance demands, tune our autoscaling policies, and keep track of general performance trends as new features are added. An automated process that uses dynamic Archaius configurations within a Zuul filter steadily increases the traffic routed to a small cluster of origin servers. As the instances receive more traffic, their performance characteristics and capacity are measured. This informs us of how many EC2 instances will be needed to run at peak, whether our autoscaling policies need to be modified, and whether or not a particular build has the required performance characteristics to be pushed to production.

Multi-Region Resiliency

Zuul is central to our multi-region ELB resiliency project called Isthmus. As part of Isthmus, Zuul is used to bridge requests from the west coast cloud region to the east coast to help us have multi-region redundancy in our ELBs for our critical domains. Stay tuned for a tech blog post about our Isthmus initiative.

Zuul OSS

Today, we are open sourcing Zuul as a few different components:

zuul-core - A library containing a set of core features.
zuul-netflix - An extension library using many Netflix OSS components:

Servo for insights, metrics, monitoring
Hystrix for real time metrics with Turbine
Eureka for instance discovery
Ribbon for routing
Archaius for real-time configuration
Astyanax for and filter persistence in Cassandra

zuul-filters - Filters to work with zuul-core and zuul-netflix libraries
zuul-webapp-simple - A simple example of a web application built on zuul-core including a few basic filters
zuul-netflix-webapp- A web application putting zuul-core, zuul-netflix, and zuul-filters together.

Netflix OSS libraries in Zuul

Putting everything together, we are also providing a web application built on zuul-core and zuul-netflix. The application also provides many helpful filters for things such as:

Weighted load balancing to balance a percentage of load to a certain server or cluster for capacity testing
Request debugging
Routing filters for Apache HttpClient and Netflix Ribbon
Statistics collecting

We hope that this project will be useful for your application and will demonstrate the strength of our open source projects when using Zuul as a glue across them, and encourage you to contribute to Zuul to make it even better. Also, if this type of technology is as exciting to you as it is to us, please see current openings on our team: jobs

Mikey Cohen - API Platform
Matt Hawthorne - API Platform

By Shobana Radhakrishnan

We recently held a Meetup on our campus for Bay Area women in the Cloud Space, in collaboration with Cloud-NOW. Women from across a number of companies and backgrounds related to Cloud attended the event and participated in talks and panel discussions on various technical topics related to the cloud. I kicked off the evening, introducing Yury Izrailevsky, VP Cloud Computing and Platform Engineering at Netflix. Yury talked about the story of how Netflix entered Cloud Computing with their streaming service and scaled it by 100 times in 4 years, and how women engineers were a significant part of that effort. I also shared how Netflix engineering scales along two dimensions - strong technology and tools leveraging open source extensively, as well as a nimble culture without bureaucracy or unnecessary process.

Cloud-NOW's Rita Scroggin, as co-host, introduced this non-profit consortium of the leading women in cloud computing, focused on using technology for the overall professional development of women (cloud-now.org).

Keynote speaker Annika Jimenez took the stage next. Annika is Global head of Data Science Services at Pivotal, the brand-new Big Data spinoff of EMC and VMware, in which GE has invested as well. Annika shared with the audience the reasons for why data science is changing the way data computing is done. She showed how internet giants like Netflix, Google, Yahoo! and Facebook are leading the way to big data in the cloud, and that so much more work remains to be done.

Devika Chawla spoke next. She leads the Netflix engineering team that is moving all the customer messaging to the cloud - those messages we get inviting us to join, create accounts, watch suggested movies, provide commentary and rejoin if we happen to have lapsed. Devika's team must be able to do this across devices (phone, iPad, TV…), to individuals as well as the entire Netflix user base, and ever-faster. Building in the Cloud enables them to meet this challenge scalably and cost-effectively.

This was followed by various breakout sessions covering topics such as Cloud Security (IBM’s Uma Subramaniam with Netflix’s Jason Chan as co-host), Testing in a multi-cloud environment (Dell’s Seema Jethani with Netflix’s Sangeeta Narayanan as co-host) and Metrics for the cloud (Jeremy Edberg from Netflix and Globalization Expert Jessica Roland). Fang Ji also gave a peek at internal Netflix tools for monitoring cloud cost and performance, including AWSUsage which will be OSS soon. Panel Discussion on Engineering leadership from Verticloud leaders Ellen Salisbury and Anna Sidana, as well as insight into how the Freedom and Responsibility culture works for engineering from Netflix VP of Talent Acquisition Jessica Neal, rounded out the discussion sessions.

Afterward, participants were treated to some incredible sushi and a tour of some of our most exciting product demos, including some of the popular Netflix open source contributions. This included:
Asgard - web interface for application deployments and management in Amazon Web Services (AWS)
Garbage Collection Visualization (or GCViz) - Tool that turns the semi-structured data from the java garbage collector's gc.log into time-series charts for easy visual analysis. On Github here.
Genie - suite of tools, deployable on top of the Hadoop ecosystem, that enables even non-technical users to develop, tune, and maintain efficient Hadoop workflows and easily interact with and visualize datasets.
Hystrix and Turbine - used to understand API traffic patterns and system behavior. More at this blog post.

With about two petabytes of data in cloud, serving more than 36 million subscribers across more than 50 countries and territories, Netflix is always evolving new tools to manage our cloud systems, and encouraging innovation with prizes like our Cloud Prize. Deadline is September 15! We are also always seeking the best engineering talent to help - check out jobs.netflix.com if interested in solving these challenges.

On Christmas Eve, 2012, Netflix streaming service experienced an outage. For full details, see“A Closer Look at the Christmas Eve Outage” by Adrian Cockcroft. This outage was particularly painful, both because of the timing, as well as the root cause - ELB control plane, was outside of our ability to correct. While our applications were running healthy, no traffic was getting to them. AWS teams worked diligently to correct the problem, though it took several hours to completely resolve the outage.

Following the outage, our teams had many discussions focusing on lessons and takeaways. We wanted to understand how to strengthen our architecture so we can withstand issues like a region-wide ELB outage without service quality degradation to the Netflix users. If we wanted to survive such outage, we needed a set of ELB’s hosted at another region that we could use to route the traffic to our backend services. That was the starting point.

Isthmus

At end of 2012 we were already experimenting with a setup internally referred to as “Isthmus” (definition here), for a different purpose - we wanted to see if setting up a thin layer of ELB + a routing layer at remote AWS region and using persistent long distance connections between the routing layer and the backend services would improve latency of user experience. We realized we can use a similar setup to achieve multi-regional ELB resiliency. Under normal operation, traffic would flow through both regions. If one of the regions would experience ELB issues, we would route via DNS all the traffic through another region.

The routing layer that we used was developed by our API team. It’s a powerful and flexible layer that can maintain pool of connections, allows smart filtering and much more. You can find full details at ourNetflixOSS GitHub site. Zuul is at the core of the Isthmus setup - it forwards all of user traffic and establishes the bridge (or an Isthmus) between 2 AWS regions.

We had to make some more changes to our internal infrastructure to support this effort. Eureka - our service discovery solution normally operated within an AWS region. In this particular setup, we needed Eureka in US-West2 region to be aware of Netflix services in US-East. In addition, our middle-tier IPC layer -Ribbon, needed to understand whether to route requests to a service local to the region, or in a remote location.

We route user traffic to a particular set of ELBs via DNS. The changes were typically done by one of our engineers through the DNS provider UI console - one endpoint at a time. This method is manual and does not work well in case of an outage. Thus,Denominator was born - an open source library to work with DNS providers and allow such changes to be done programmatically. Now we could automate and repeatedly execute directional DNS changes.

Putting it all together: changing user traffic in production

In the weeks following the outage, we stood up the infrastructure necessary to support Isthmus and were ready to test it out. After some internal tests, and stress tests by simulating production-level traffic in our test environment, we deployed Isthmus in production, though it was taking no traffic yet. Since the whole system was brand-new, we proceeded very carefully. We started with a single endpoint, though a rather important one - our API services. Gradually, we increased % of production traffic that it was taking: 1%, 5% and so on, until we verified that we could actually route 100% of user traffic through an Isthmus without any detrimental effects to user experience. Traffic routing was done with DNS geo-directional changes - specifying which States to route to which endpoint.

After success with the API service working in Isthmus mode, we proceeded to repeat the same setup with other services that enable Netflix streaming. Not taking any chances, we’ve repeated the same gradual ramp-up and validation as we did with API. Similar sequence, though at faster ramp-up speeds was followed for the remaining services that ensure user’s ability to browse and stream movies.

Over last 2 months we’ve been shifting production user traffic between AWS regions to reach the desired stable state - where traffic flows approximately 50/50% between 2 US-East and US-West regions.

The best way we can prove that this setup solves the problem we set out to resolve is by actually simulating an ELB outage - and verifying that we could throw all the traffic to another AWS region. We’re currently planning such "Chaos" exercise and will be executing it shortly.

First step towards the goal of Multi-Regional Resiliency

The work that we’ve done so far improved our architecture to better handle region-wide ELB outages. ELB is just one service dependency though - and we have many more. Our goal is to be able to survive any region-wide issue - either a complete AWS Region failure, or a self-inflicted problem - with minimal or no service quality degradation to Netflix users. For example, the solution we’re working on should mitigate outages like we had on June 5th. We’re starting to replicate all the data between the AWS regions, and eventually will stand up full complement of services as well. We’re working on such efforts now, and are looking for a few great engineers to join our Infrastructure teams. If these are the types of challenges you enjoy - check outNetflix Jobs site for more details. To learn out more about Zuul, Eureka, Ribbon and other NetflixOSS components, join us for the upcoming NetflixOSS Meetup on July 17, 2013.

By Ariel Tseitlin, Fang Ji, Coburn Watson

One of the advantages of moving to the cloud was increased engineering velocity. Every engineer who needed cloud resources was able to procure them at the click of a button. This led to an increase in resource usage and allowed us to move more quickly as an organization. At the same time, seeing the big picture of how many resources were used and by whom became more difficult. In addition, Netflix is a highly decentralized environment where each service team decides how many resources their services need. The elastic nature of the cloud make capacity planning less crucial and teams can simply add resources as needed. Viewing the broad picture of cloud resource usage becomes more difficult in such an environment. To address both needs, Netflix created Ice.

Ice provides a birds-eye view of our large and complex cloud landscape from a usage and cost perspective. Cloud resources are dynamically provisioned by dozens of service teams within the organization and any static snapshot of resource allocation has limited value. The ability to trend usage patterns on a global scale, yet decompose them down to a region, availability zone, or service team provides incredible flexibility. Ice allows us to quantify our AWS footprint and to make educated decisions regarding reservation purchases and reallocation of resources.

We are thrilled to announce that today Ice joins the NetflixOSS platform. You can get the source code on GitHub at https://github.com/Netflix/ice.

Features

Ice communicates with Amazon’s Programmatic Billing Access and maintains knowledge of the following key AWS entity categories:

Accounts
Regions
Services (e.g. EC2, S3, EBS)
Usage types (e.g. EC2 - m1.xlarge)
Cost and Usage Categories (On-Demand, Un-Used, Reserved, etc.)

The UI allows you to filter directly on the above categories to custom tailor your view and slice and dice your billing data.

In addition, Ice supports the definition of Application Groups. These groups are explicitly defined collections of resources in your organization. Such groups allow usage and cost information to be aggregated by individual service teams within your organization, each consisting of multiple services and resources. Ice also provides the ability to email weekly cost reports for each Application Group showing current usage and past trends.

When representing the cost profile for individual resources, Ice will factor the depreciation schedule into your cost contour, if so desired. The ability to amortize one-time purchases, such as reservations, over time allows teams to better evaluate their month-to-month cost footprint.

Getting started

After signing up for Programmatic Billing Access, follow the instructions at https://github.com/Netflix/ice#prerequisite to get started.

Conclusion

Ice has provided Netflix with insights into our AWS usage and spending. It helps us identify inefficient usage and influences our reservation purchases. It provides our entire product development organization with visibility into how many cloud resource they are using and enables each team to make engineering decisions to better manager usage. We hope that by releasing it as part of NetflixOSS, the rest of the community can realize similar, and even greater, benefits.

If you are interested in joining us on further building our distributed, scalable, and highly available NetflixOSS platform, please take a look at our jobs listing.

Genie is out of the bottle

by Sriram Krishnan

In a prior tech blog, we had discussed the architecture of our petabyte-scale data warehouse in the cloud. Salient features of our architecture include the use of Amazon’s Simple Storage Service (S3) as our "source of truth", leveraging the elasticity of the cloud to run multiple dynamically resizable Hadoop clusters to support various workloads, and our horizontally scalable Hadoop Platform as a Service called Genie.

Today, we are pleased to announce that Genie is now open source, and available to the public from the Netflix OSS GitHub site.

What is Genie?

Genie provides job and resource management for the Hadoop ecosystem in the cloud. From the perspective of the end-user, Genie abstracts away the physical details of various (potentially transient) Hadoop resources in the cloud, and provides a REST-ful Execution Service to submit and monitor Hadoop, Hive and Pig jobs without having to install any Hadoop clients. And from the perspective of a Hadoop administrator, Genie provides a set of Configuration Services, which serve as a registry for clusters, and their associated Hive and Pig configurations.

Why did we build Genie?

There are two main reasons why we built Genie. Firstly, we run multiple Hadoop clusters in the cloud to support different workloads at Netflix. Some of them are launched as needed, and are hence transient - for instance, we spin up “bonus” Hadoop clusters nightly to augment our resources for ETL (extract, transform, load) processing. Others are longer running (viz. our regular “SLA” and “ad-hoc” clusters) - but may still be re-spun from time to time, since we work under the operating assumption that cloud resources may go down at any time. Users need to discover the latest incarnations of these clusters by name, or by the type of workloads that they support. In the data center, this is generally not an issue since Hadoop clusters don’t come up or go down frequently, but this is much more common in the cloud.

Secondly, end-users simply want to run their Hadoop, Hive or Pig jobs - very few of them are actually interested in launching their own clusters, or even installing all the client-side software and downloading all the configurations needed to run such jobs. This is generally true in both the data center and the cloud. A REST-ful API to run jobs opens up a wealth of opportunities, which we have exploited by building web UIs, workflow templates, and visualization tools that encapsulate all our common patterns of use.

What Genie Isn’t

Genie is not a workflow scheduler, such as Oozie. Genie’s unit of execution is a single Hadoop, Hive or Pig job. Genie doesn’t schedule or run workflows - in fact, we use an enterprise scheduler (UC4) at Netflix to run our ETL.

Genie is not a task scheduler, such as the Hadoop fair share or capacity schedulers either. We think of Genie as a resource match-maker, since it matches a job to an appropriate cluster based on the job parameters and cluster properties. If there are multiple clusters that are candidates to run a job, Genie will currently choose a cluster at random. It is possible to plug in a custom load balancer to choose a cluster more optimally - however, such a load balancer is currently not available.

Finally, Genie is not an end-to-end resource management tool - it doesn’t provision or launch clusters, and neither does it scale clusters up and down based on their utilization. However, Genie is a key complementary tool, serving as a repository of clusters, and an API for job management.

How Genie Works

The following diagram explains the core components of Genie, and its two classes of Hadoop users - administrators, and end-users.

Genie itself is built on top of the following Netflix OSS components:

Karyon, which provides bootstrapping, runtime insights, diagnostics, and various cloud-ready hooks,
Eureka, which provides service registration and discovery,
Archaius, for dynamic property management in the cloud,
Ribbon, which provides Eureka integration, and client-side load-balancing for REST-ful interprocess communication, and
Servo, which enables exporting metrics, registering them with JMX (Java Management Extensions), and publishing them to external monitoring systems such as Amazon's CloudWatch.

Genie can be cloned from GitHub, built, and deployed into a container such as Tomcat. But it is not of much use unless someone (viz. an administrator) registers a Hadoop cluster with it. Registration of a cluster with Genie is as follows:

Hadoop administrators first spin up a Hadoop cluster, e.g. using the EMR client API.
They then upload the Hadoop and Hive configurations for this cluster (*-site.xml’s) to some location on S3.
Next, the administrators use the Genie client to discover a Genie instance via Eureka, and make a REST-ful call to register a cluster configuration using a unique id, and a cluster name, along with a few other properties - e.g. that it supports “SLA” jobs, and the “prod” metastore. If they are creating a new metastore configuration, then they may also have to register a new Hive or Pig configuration with Genie.

After a cluster has been registered, Genie is now ready to grant any wish to its end-users - as long as it is to submit Hadoop jobs, Hive jobs, or Pig jobs!

End-users use the Genie client to launch and monitor Hadoop jobs. The client internally uses Eureka to discover a live Genie instance, and Ribbon to perform client-side load balancing, and to communicate REST-fully with the service. Users specify job parameters, which consist of:

A job type, viz. Hadoop, Hive or Pig,
Command-line arguments for the job,
A set of file dependencies on S3 that can include scripts or UDFs (user defined functions).

Users must also tell Genie what kind of Hadoop cluster to pick. For this, they have a few choices - they can use a cluster name or a cluster ID to pin to a specific cluster, or they can use a schedule (e.g. SLA) and a metastore configuration (e.g. prod), which Genie will use to pick an appropriate cluster to run a job on.

Genie creates a new working directory for each job, stages all the dependencies (including Hadoop, Hive and Pig configurations for the chosen cluster), and then forks off a Hadoop client process from that working directory. It then returns a Genie job ID, which can be used by the clients to query for status, and also to get an output URI, which is browsable during and after job execution (see below). Users can monitor the standard output and error of the Hadoop clients, and also look at Hive and Pig client logs, if anything went wrong.

The Genie execution model is very simple - as mentioned earlier, Genie simply forks off a new process for each job from a new working directory. Other than simplicity, important benefits of this approach include isolation of jobs from each other and from Genie, and easy accessibility of standard output, error and job logs for our end-users (since they are browsable from the output URIs). We made a decision not to queue up jobs in Genie - if we had implemented a job queue, we would have had to implement a fair-share or capacity scheduler for Genie as well, which is already available at the Hadoop level. The downside of this approach is that a JVM is spawned for each job, which implies that Genie can only run a finite number of concurrent jobs on an instance, based on available memory.

Deployment at Netflix

Genie scales horizontally using ASGs (Auto-Scaling Groups) in the cloud, which helps us run several hundreds of concurrent Hadoop jobs in production at Netflix, with the help of Asgard for cloud management and deployment. We use Asgard (see screenshot below) to pick minimum, desired and maximum instances (for horizontal scalability) in multiple availability zones (for fault tolerance). For Genie server pushes, Asgard provides the concept of a “sequential ASG”, which lets us route traffic to new instances of Genie once a new ASG is launched, and turn off traffic to old instances by marking the old ASG out of service.

Using Asgard, we can also set up scaling policies to handle variable loads. The screenshot below shows a sample policy, which increases the number of Genie instances (by one) if the average number of running jobs per instance is greater than or equal to 25.

Usage at Netflix

Genie is being used in production at Netflix to run several thousands of Hadoop jobs daily, processing hundreds of terabytes of data. The screenshot below (from our internal Hadoop investigative tool, code named “Sherlock”) shows some of our clusters over a period of a few months.

The blue line shows one of our SLA clusters, while the orange line shows our main ad-hoc cluster. The red line shows another ad-hoc cluster, with a new experimental version of a fair-share scheduler. Genie was used to route jobs to one of the two ad-hoc clusters at random, and we measured the impact of the new scheduler on the second ad-hoc cluster. When we were satisfied with the performance of the new scheduler, we spun up another larger consolidated ad-hoc cluster with the new scheduler (also shown by the orange line), and all new ad-hoc Genie jobs were now routed to this latest incarnation. The two older clusters were terminated once all running jobs were finished (we call this a “red-black” push).

Summary

Even though Genie is now open source, and has been running in production at Netflix for months, it is still a work in progress. We think of the initial release as version 0. The data model for the services is fairly generic, but definitely biased towards running at Netflix, and in the cloud. We hope for community feedback and contributions to broaden its applicability, and enhance its capabilities.

We will be presenting Genie at the 2013 Hadoop Summit during our session titled “Genie - Hadoop Platform as a Service at Netflix”, and demoing Genie and other tools that are part of the Netflix Hadoop toolkit at the Netflix Booth. Please join us for the presentation, and/or feel free to stop by the booth, chat with the team, and provide feedback.

If you are interested in working on great open source software in the areas of big data and cloud computing, please take a look at jobs.netflix.com for current openings!

References

Genie OSS
Genie Wiki: Getting Started
Netflix Open Source Projects
@NetflixOSS Twitter Feed

by Jeff Magnusson, Charles Smith, John Lee, and Nathan Bates

We’re pleased to announce Lipstick (our Pig workflow visualization tool) as the latest addition to the suite of Netflix Open Source Software.

At Netflix, Apache Pig is used heavily amongst developers when productionizing complex data transformations and workflows against our big data. Pig provides good facilities for code reuse in the form of Python and Java UDFs and Pig macros. It also exposes a simple grammar that allows our users to easily express workflows on big datasets without getting “lost of the weeds” worrying about complicated MapReduce logic.

While Pig’s high level of abstraction is one of its most attractive features, scripts can quickly reach a level of complexity upon which the flow of execution, and it’s relation to the MapReduce jobs being executed, become difficult to conceptualize. This tends to prolong and complicate the effort required to develop, maintain, debug, and monitor the execution of scripts in our environment. In order to address these concerns we have developed Lipstick, a tool that enables developers to visualize and monitor the execution of their data flows at a logical level.

Lipstick was initially developed as a stand-alone tool that produced a graphical depiction of a Pig workflow. While useful, we quickly realized that combining the workflow with information about the job as it ran gave the developer insight that previously required a lot of sifting through logs (or a Pig expert) to piece together. Now, as an implementation of Pig Progress Notification Listener, Lipstick piggybacks on top of all Pig scripts executed in our environment notifying a Lipstick server of job executions and periodically reporting progress as the script executes.

The screenshot above shows Lipstick in action. In this example the developer would see:

This script compiled into 4 MapReduce jobs (two of which we can see represented by the blue bounding boxes)
Which logical operations execute in the mappers (blue header) vs the reducers (orange header)
Row counts from load / store / dump operations, as well as in between MapReduce jobs

Had the script been currently executing, the boxes representing MapReduce jobs would have been flashing colors (blue or orange) to represent that they were currently executing in the map or reduce phase, and intermediate row counts would have been updating periodically as the Pig script heartbeat back to the Lipstick server.

Lipstick has many cool features (check out the user guide to learn more), but there are two that we think are especially useful:

Clicking on intermediate row counts between MapReduce jobs displays a sample of intermediate results.

A toggle that switches between optimized and unoptimized versions of the logical plan. This allows users to easily see how Pig is applying optimizations to the script (e.g. filters pushed into the loader).

In the months we've been using Lipstick, it has already proven its worth many times over and we are just getting started. If you would like to use Lipstick yourself or help us make it better, download it and give us your feedback. If you like building tools that makes it easier to work with big data (like Lipstick) check out our jobs page as well.

By Anthony Park and Mark Watson.

We've previously discussed our plans to use HTML5 video with the proposed "Premium Video Extensions" in any browser which implements them. These extensions are the future of premium video on the web, since they allow playback of premium video directly in the browser without the need to install plugins.

Today, we're excited to announce that we've been working closely with Microsoft to implement these extensions in Internet Explorer 11 on Windows 8.1. If you install the Windows 8.1 Preview from Microsoft, you can visit Netflix.com today in Internet Explorer 11 and watch your favorite movies and TV shows using HTML5!

Microsoft implemented the Media Source Extensions (MSE) using the Media Foundation APIs within Windows. Since Media Foundation supports hardware acceleration using the GPU, this means that we can achieve high quality 1080p video playback with minimal CPU and battery utilization. Now a single charge gets you more of your favorite movies and TV shows!

Microsoft also has an implementation of the Encrypted Media Extensions (EME) using Microsoft PlayReady DRM. This provides the content protection needed for premium video services like Netflix.

Finally, Microsoft implemented the Web Cryptography API (WebCrypto) in Internet Explorer, which allows us to encrypt and decrypt communication between our JavaScript application and the Netflix servers.

We expect premium video on the web to continue to shift away from using proprietary plugin technologies to using these new Premium Video Extensions. We are thrilled to work so closely with the Microsoft team on advancing the HTML5 platform, which gets a big boost today with Internet Explorer’s cutting edge support for premium video. We look forward to these APIs being available on all browsers.

By Paul Adolph

At Netflix we are excited to build an HTML5-based player for our service, as described in a previous blog post. One of the “Premium Video Extensions” mentioned in that post is the Web Cryptography API, which “describes a JavaScript API for performing basic cryptographic operations in web applications, such as hashing, signature generation and verification, and encryption and decryption.” Netflix uses this API to secure the communication between our JavaScript and the Netflix servers.

The Web Cryptography WG of the W3C (of which Netflix is a member) produces the Web Cryptography API specification. Currently the spec is in the Working Draft stage and some browser vendors are waiting until the spec is more finalized before proceeding with their implementations. A notable exception is Microsoft, who worked with us to implement a draft version of the spec in Internet Explorer 11 for Windows 8.1 Preview, which now allows plugin-free Netflix video streaming.

To continue integrating our HTML5 application with other browsers, we decided to implement a polyfill based on the April 22, 2013 Editor’s Draft of the Web Cryptography API specification plus some other proposals under discussion. While similar in principle to JavaScript-based Web Crypto polyfills such as PolyCrypt, ours is implemented in native C++ (using OpenSSL 1.0.1c) to avoid the security risks of doing crypto in pure JavaScript. And because crypto functionality does not require deep browser integration, we were able to implement the polyfill as a stand-alone browser plugin, with our first implementation targeting Google’s Chrome browser using the Pepper Plugin API (PPAPI) framework.

So that you can also experiment with cryptography on the web, and to support the ongoing development of the specification in the W3C, we’ve released this NfWebCrypto plugin implementation as open source under the Apache Version 2.0 license. While NfWebCrypto is not yet a complete implementation of the Web Cryptography API, and may differ from the most recent version of the rapidly changing spec, we believe it has the mainstream crypto features many web applications will require. This means that you can use this plugin to try a version of the Web Cryptography API now, before it comes to your favorite browser.

At the moment the plugin is only supported in Chrome on Linux amd64 (tested in Ubuntu 12.04). For the latest details of what works and what does not, please see the README file in the NfWebCrypto GitHub repository. Here is a summary of the algorithms that are currently supported:

SHA1, SHA224, SHA256, SHA384, SHA512: digest
HMAC SHA: sign, verify, importKey, exportKey, generateKey
AES-128 CBC w/ PKCS#5 padding: encrypt, decrypt, importKey, exportKey, generateKey
RSASSA-PKCS1-v1_5: sign, verify, importKey, generateKey
RSAES-PKCS1-v1_5: encrypt, decrypt, importKey, exportKey, generateKey
Diffie-Hellman: generateKey, deriveKey
RSA-OAEP: wrapKey*, unwrapKey*
AES-KW: wrapKey*, unwrapKey*

*Wrap/Unwrap operations follow the Netflix KeyWrap Proposal and support protection of the JWE payload with AES128-GCM.

NfWebCrypto will of course be obsolete once browser vendors complete their implementations. In the meantime, this plugin is a stop-gap measure to allow people to move forward with cryptography on the web. Since finalization of the spec may still be some time away, we hope the community will benefit from this early look. We also hope that a concrete implementation will provide a backdrop against which the evolving spec can be evaluated. Finally, the NfWebCrypto JavaScript unit tests and perhaps the actual C++ implementation may be useful references for browser vendors.

Moving forward, we plan to keep pace with the W3C spec the best we can as it evolves. We welcome contributions to NfWebCrypto from the open source community, particularly in the areas of security audits, expanding the unit tests, and porting to other browser plugin frameworks and platforms.

You can find NfWebCrypto at the Netflix Open Source Center on GitHub.

by Adrian Cockcroft and Ruslan Meshenberg

Our third NetflixOSS Meetup introduced our latest project releases, updates on the NetflixOSS Cloud Prize, and featured demonstrations from contributors as well as many Netflix projects.

The projects we covered in the lightning presentations were:

Pytheas - A web based framework for quickly building dashboards
Conformity Monkey - Maintain best practices for cloud deployments
Zuul - Edge tier for dynamic filtering of requests
Ice - AWS usage and cost analysis tool
Genie - Hadoop platform abstraction service for EMR
Lipstick - Visualization of Pig workflows

We had demonstrations of the above projects plus from NetflixOSS contributors:

Eucalyptus - V3.3 is now in production with support for NetflixOSS tools
IBM - Scalable implementation of Acme Air demo using NetflixOSS
Paypal - Rewrite of Asgard console to support Openstack deployments
Riot Games - Cloud Native architecture based on many NetflixOSS projects

There was another good turnout, beer, wine, plenty of greek food and lots of discussion around the demo stations. In the afternoon before the meetup we had a workshop/bootcamp with a small number of our most active NetflixOSS contributors. We were able to help them with their projects while also getting a lot of extremely useful feedback on many aspects of the NetflixOSS program.

We are happy with the way that NetflixOSS is helping raise awareness of Cloud Native architecture, and how it has been adopted by larger organizations. It is in use at places like Riot Games that employ some ex-Netflix engineers, who continue to contribute code that Netflix uses even though we no longer need to pay them! However we are aware that individual Cloud Prize contestants and smaller organizations are suffering from "Technical Indigestion" because there is too much here for people to absorb and to get up and running quickly. To address this we have been concentrating our efforts on making it easier to get started. We were able to announce our first official Netflix AMI for Asgard at the event, and will be producing more of them in the coming weeks. We also have contributions to the NetflixOSS Cloud Prize which include Puppet based AMIs, a Chef Cookbook for Ice, and Ansible Playbooks.

The NetflixOSS Cloud Prize has inspired some additional prizes, Citrix have said they will give $10K to the best contribution to getting NetflixOSS to work with Apache Cloudstack. In addition Canonical have created their own contest (based on a fork of the NetflixOSS Cloud Prize rules) for their Ubuntu Juju orchestration application, to create Juju Charms that install and manage applications based on combinations of individual services. There are several prizes of $10K available. Canonical want to encourage the creation Juju charms for installing NetflixOSS based applications and Mark Shuttleworth of Canonical will be joining the NetflixOSS Cloud Prize judges, while Adrian Cockcroft will help judge the Juju Charm Championship.

We have an outline plan to hold another NetflixOSS Meetup after the deadline for the NetflixOSS Cloud Prize on September 15th, where we will reveal the Nominations in each category. The final prize winners will be announced and receive their prizes at AWS Re:Invent, November in Las Vegas.

We hope to see you there!

Here's the slides:

Netflix oss season 1 episode 3 from Ruslan Meshenberg

And the video:
NetflixOSS S1 E3 Video

by Ben Schmaus

As described in previous posts (“Embracing the Differences” and “Optimizing the Netflix API”), the Netflix API serves as an integration hub that connects our device UIs to a distributed network of data services. In supporting this ecosystem, the API needs to integrate an ever-evolving set of features from these services and expose them to devices. The faster these features can be delivered through the API, the faster they can get in front of customers and improve the user experience.

Along with the number of backend services and device types (we’re now topping 1,000 different device types), the rate of change in the system is increasing, resulting in the need for faster API development cycles. Furthermore, as Netflix has expanded into international markets, the infrastructure and team supporting the API has grown. To meet product demands, scale the team’s development, and better manage our cloud infrastructure, we've had to adapt our process and tools for testing and deploying the API.

With the context above in mind, this post presents our approach to software delivery and some of the techniques we've developed to help us get features into production faster while minimizing risk to quality of service.

Moving Toward Continuous Delivery

Before moving on it’s useful to draw a quick distinction between continuous deployment and delivery. Per the book, if you’re practicing continuous deployment then you’re necessarily also practicing continuous delivery, but the reverse doesn’t hold true. Continuous deployment extends continuous delivery and results in every build that passes automated test gates being deployed to production. Continuous delivery requires an automated deployment infrastructure but the decision to deploy is made based on business need rather than simply deploying every commit to prod. We may pursue continuous deployment as an optimization to continuous delivery but our current focus is to enable the latter such that any release candidate can be deployed to prod quickly, safely, and in an automated way.

To meet demand for new features and to make a growing infrastructure easier to manage, we’ve been overhauling our dev, build, test, and deploy pipeline with an eye toward a continuous delivery. Being able to deploy features as they’re developed gets them in front of Netflix subscribers as quickly as possible rather than having them “sit on the shelf.” And deploying smaller sets of features more frequently reduces the number of changes per deployment, which is an inherent benefit of continuous delivery and helps mitigate risk by making it easier to identify and triage problems if things go south during a deployment.

The foundational concepts underlying our delivery system are simple: automation and insight. By applying these ideas to our deployment pipeline we can strike an effective balance between velocity and stability.

Automation - Any process requiring people to execute manual steps repetitively will get you into trouble on a long enough timeline. Any manual step that can be done by a human can be automated by a computer; automation provides consistency and repeatability. It’s easy for manual steps to creep into a process over time and so constant evaluation is required to make sure sufficient automation is in place.

Insight - You can't support, understand, and improve what you can't see. Insight applies both to the tools we use to develop and deploy the API as well as the monitoring systems we use to track the health of our running applications. For example, being able to trace code as it flows from our SCM systems through various environments (test, stage, prod, etc.) and quality gates (unit tests, regression tests, canary, etc.) on its way to production helps us distribute deployment and ops responsibilities across the team in a scalable way. Tools that surface feedback about the state of our pipeline and running apps give us the confidence to move fast and help us quickly identify and fix issues when things (inevitably) break.

Development & Deployment Flow

The following diagram illustrates the logical flow of code from feature inception to global deployment to production clusters across all of our AWS regions. Each phase in the flow provides feedback about the “goodness” of the code, with each successive step providing more insight into and confidence about feature correctness and system stability.

Taking a closer look at our continuous integration and deploy flow, we have the diagram below, which pretty closely outlines the flow we follow today. Most of the pipeline is automated, and tooling gives us insight into code as it moves from one state to another.

Branches

Currently we maintain 3 long-lived branches (though we’re exploring approaches to cut down the number of branches, with single master being a likely longer term goal) that serve different purposes and get deployed to different environments. The pipeline is fully automated with the exception of weekly pushes from the release branch, which require an engineer to kick off the global prod deployment.

Test branch - used to develop features that may take several dev/deploy/test cycles and require integration testing and coordination of work across several teams for an extended period of time (e.g., more than a week). The test branch gets auto deployed to a test environment, which varies in stability over time as new features undergo development and early stage integration testing. When a developer has a feature that’s a candidate for prod they manually merge it to the release branch.

Release branch - serves as the basis for weekly releases. Commits to the release branch get auto-deployed to an integration environment in our test infrastructure and a staging environment in our prod infrastructure. The release branch is generally in a deployable state but sometimes goes through a short cycle of instability for a few days at a time while features and libraries go through integration testing. Prod deployments from the release branch are kicked off by someone on our delivery team and are fully automated after the initial action to start the deployment.

Prod branch - when a global deployment of the release branch (see above) finishes it’s merged into the prod branch, which serves as the basis for patch/daily pushes. If a developer has a feature that's ready for prod and they don't need it to go through the weekly flow then they can commit it directly to the prod branch, which is kept in a deployable state. Commits to the prod branch are auto-merged back to release and are auto-deployed to a canary cluster taking a small portion of live traffic. If the result of the canary analysis phase is a “go” then the code is auto deployed globally.

Confidence in the Canary

The basic idea of a canary is that you run new code on a small subset of your production infrastructure, for example, 1% of prod traffic, and you see how the new code (the canary) compares to the old code (the baseline).

Canary analysis used to be a manual process for us where someone on the team would look at graphs and logs on our baseline and canary servers to see how closely the metrics (HTTP status codes, response times, exception counts, load avg, etc.) matched.

Needless to say this approach doesn't scale when you're deploying several times a week to clusters in multiple AWS regions. So we developed an automated process that compares 1000+ metrics between our baseline and canary code and generates a confidence score that gives us a sense for how likely the canary is to be successful in production. The canary analysis process also includes an automated squeeze test for each canary Amazon Machine Image (AMI) that determines the throughput “sweet spot” for that AMI in requests per second. The throughput number, along with server start time (instance launch to taking traffic), is used to configure auto scaling policies.

The canary analyzer generates a report for each AMI that includes the score and displays the total metric space in a scannable grid. For commits to the prod branch (described above), canaries that get a high-enough confidence score after 8 hours are automatically deployed globally across all AWS regions.

The screenshots below show excerpts from a canary report. If the score is too low (< 95 generally means a "no go", as is the case with the canary below), the report helps guide troubleshooting efforts by providing a starting point for deeper investigation. This is where the metrics grid, shown below, helps out. The grid puts more important metrics in the upper left and less important metrics in the lower right. Green means the metric correlated between baseline and canary. Blue means the canary has a lower value for a metric ("cold") and red means the canary has a higher value than the baseline for a metric ("hot").

Along with the canary analysis report we automatically generate a source diff report, cross-linked with Jira (if commit messages contain Jira IDs), of code changes in the AMI, and a report showing library and config changes between the baseline and canary. These artifacts increase our visibility into what’s changing between deployments.

Multi-region Deployment Automation

Over the past two years we've expanded our deployment footprint from 1 to 3 AWS regions supporting different markets around the world. Running clusters in geographically disparate regions has driven the need for more comprehensive automation. We use Asgard to deploy the API (with some traffic routing help from Zuul), and we’ve switched from manual deploys using the Asgard GUI to driving deployments programmatically via Asgard’s API.

The basic technique we use to deploy new code into production is the "red/black push." Here's a summary of how it works.

1) Go to the cluster - which is a set of auto-scaling groups (ASGs) - running the application you want to update, like the API, for example.

2) Find the AMI you want to deploy, look at the number of instances running in the baseline ASG, and launch a new ASG running the selected AMI with enough instances to handle traffic levels at that time (for the new ASG we typically use the number of instances in the baseline ASG plus 10%). When the instances in the new ASG are up and taking traffic, the new and baseline code is running side by side (ie, "red/red").

3) Disable traffic to the baseline ASG (ie, make it "black"), but keep the instances online in case a rollback is needed. At this point you'll have your cluster in a "red/black" state with the baseline code being "black" and the new code being "red." If a rollback is needed, since the "black" ASG still has all its instances online (just not taking traffic) you can easily enable traffic to it and then disable the new ASG to quickly get back to your starting state. Of course, depending on when the rollback happens you may need to adjust server capacity of the baseline ASG.

4) If the new code looks good, delete the baseline ASG and its instances altogether. The new AMI is now your baseline.

The following picture illustrates the basic flow.

Going through the steps above manually for many clusters and regions is painful and error prone. To make deploying new code into production easier and more reliable, we've built additional automation on top of Asgard to push code to all of our regions in a standardized, repeatable way. Our deployment automation tooling is coded to be aware of peak traffic times in different markets and to execute deployments outside of peak times. Rollbacks, if needed, are also automated.

Keep the Team Informed

With all this deployment activity going on, it’s important to keep the team informed about what’s happening in production. We want to make it easy for anyone on the team to know the state of the pipeline and what’s running in prod, but we also don’t want to spam people with messages that are filtered out of sight. To complement our dashboard app, we run an XMPP bot that sends a message to our team chatroom when new code is pushed to a production cluster. The bot sends a message when a deployment starts and when it finishes. The topic of the chatroom has info about the most recent push event and the bot maintains a history of pushes that can be accessed by talking to the bot.

Move Fast, Fail Fast (and Small)

Our goal is to provide the ability for any engineer on the team to easily get new functionality running in production while keeping the larger team informed about what’s happening, and without adversely affecting system stability. By developing comprehensive deployment automation and exposing feedback about the pipeline and code flowing through it we’ve been able to deploy more easily and address problems earlier in the deploy cycle. Failing on a few canary machines is far superior to having a systemic failure across an entire fleet of servers.

Even with the best tools, building software is hard work. We're constantly looking at what's hard to do and experimenting with ways to make it easier. Ultimately, great software is built by great engineering teams. If you're interested in helping us build the Netflix API, take a look at some of our openroles.

by Clay McCoy

While adding a new automated deployment feature to Asgard we realized that our current in-memory task system was not sufficient for the new demands. There would now be tasks that would be measured in hours or days rather than minutes and that work needed to be resilient to the failure of a single Asgard instance. We also wanted better asynchronous task coordination and the ability to distribute these tasks among a fleet of Asgard instances.

Amazon's Simple Workflow Service

Amazon's Simple Workflow Service (SWF) is a task based API for building highly scalable and resilient applications. With SWF the progress of your tasks is persisted by AWS while all the actual work is still done on your own servers. Your services poll for decision tasks and activity tasks. Decision tasks simply determine what to do next (start an activity, start a timer...) based on the workflow progress so far. This is high level logic that orchestrates your activities and should execute very quickly. Activity tasks are where real processing is performed (calculations, contacting remote services, I/O...). SWF was exactly what we were looking for in a distributed task system, but we quickly realized that it can be arduous writing a workflow against the base SWF API. It is up to you to do a lot of low level operations and your actual application logic can get lost in the mix.

Amazon's Flow Framework

Amazon anticipated our predicament and provided the Flow Framework which is a higher level API on top of SWF. It minimizes SWF based boilerplate code and makes your workflow look more like ordinary Java code. It also provides a lot of useful SWF infrastructure for registering SWF objects, polling for tasks, analyzing workflow history, and responding with decisions. Flow enforces a programming model where you implement your own interfaces for workflows and activities.

The interfaces contain special Flow annotations that identify their roles and allow specification of versions, timeouts, and more.

@Workflow
@WorkflowRegistrationOptions(
    defaultExecutionStartToCloseTimeoutSeconds = 60L)
interfaceTestWorkflow {
@Execute(version = '1.0')
void doIt()
}

@Activities(version = '1.0')
@ActivityRegistrationOptions(
    defaultTaskScheduleToStartTimeoutSeconds = -1L,
    defaultTaskStartToCloseTimeoutSeconds = 300L)
interfaceTestActivities {
    String doSomething()
void consumeSomething(String thing)
}

Flow generates code to make your activities asynchronous. Promises will need to wrap your activity method return values and parameters. Rather than the TestActivities above, you will program against the generated TestActivitiesClient below.

interfaceTestActivitiesClient {
    Promise<String> doSomething()
void consumeSomething(Promise<String> thing)
}

The workflow implementation is your decider logic which gets replayed repeatedly until your workflow is complete. In your workflow implementation you can reference the generated activities client that was just described. Flow uses AspectJ and @Asynchronous annotations on methods to ensure that promises are ready before executing the method body that uses their results. In this example, 'doIt' is the entry point to the workflow due to the @Execute annotation on the interface above. First we 'doSomething' and wait on the result before we send it to 'consumeSomething'.

classTestWorkflowImplimplementsTestWorkflow {
privatefinal TestActivitiesClient client = new TestActivitiesClientImpl();

void doIt() {
        Promise<String> result = client.doSomething()
  waitForSomething(result)
    }

@Asynchronous
void waitForSomething(Promise<String> something) {
        client.consumeSomething(something)
    }
}

Flow clearly offers a lot of help in easing the use of SWF. Unfortunately its dependence on AspectJ and code generation kept us from using it as is. Asgard is a Groovy and Grails application that already has enough byte code manipulation and runtime magic. Since Groovy itself is well suited to the job of hiding boilerplate code we began to wonder if we could use it to get what we wanted from SWF.

Netflix OSS Glisten

Glisten is an ease of use SWF library developed at Netflix. It still uses core Flow objects but does not require AspectJ or code generation. Glisten provides WorkflowOperations and ActivitiesOperations interfaces that can be used by your WorkflowImplementation and ActivitiesImplementation classes respectively. All of the SWF specifics are hidden behind these operation interfaces in specific SWF implementations. There are also local implementations that allow for easy unit testing of workflows and activities.

Let's take a look at what a Glisten based workflow implementation looks like. Without code generation or AspectJ we no longer have the use of generated clients or the @Asynchronous annotation. Instead we use WorkflowOperations to provide 'activites' and 'waitFor' in addition to many other workflow concerns. Note that the Groovy annotation @Delegate is used here to allow the WorkflowOperations' public methods to appear on TestWorkflowImpl itself just to clean up the code. Like in the Flow example above, the 'doSomething' activity is scheduled and then we 'waitFor' its result to be ready. Once ready, the closure is executed where the 'consumeSomething' activity is provided with an 'it' parameter. In Groovy you can use 'it' to refer to an implicit parameter that is made available to the closure. Here 'it' is the result of the Promise passed into 'waitFor'. This is a pretty dense example of how we are using Groovy to handle some of the syntactic sugar that we lost from Flow by removing AspectJ and code generation.

classTestWorkflowImplimplementsTestWorkflow {
@Delegate
    WorkflowOperations<TestActivities> workflowOperations = SwfWorkflowOperations.of(TestActivities)

void doIt() {
        waitFor(activities.doSomething()) {
            activities.consumeSomething(it)
        }
    }
}

Glisten is a lightweight way to use SWF and only requires a dependency on Groovy. Most of your code can still be written in Java if you prefer. Glisten is currently used in Asgard to enable long lived deployment tasks. There is a comprehensive example workflow in the Glisten codebase and documented on the wiki. It demonstrates many SWF features (timers, parallel tasks, retries, error handling...) along with unit tests.

Glisten makes it easier for us to use Amazon's SWF, and maybe it can help you too. If you are interested in helping develop projects like this feel free to contribute or even join us at Netflix.

by Adrian Cockcroft

We launched the Netflix Open Source Software Cloud Prize in March 2013 and it got a lot of attention in the press and blogosphere. Six months later we closed the contest, took a good look at the entrants, picked the best as nominees and announced them at a Netflix Meetup in Los Gatos on September 25th. The next step is for the panel of distinguished judges to decide who wins in each category, and the final winners will be announced at AWS Re:Invent in November.

Starting the nominations with some “monkey business”, we were looking for additions to the Netflix Simian Army that automates operations and failure testing for NetflixOSS. We have three nominees who between them built sixteen new monkeys, and one portability extension.

Peter Sankauskas (pas256, of San Mateo California) who built Backup Monkey and Graffiti Monkey concentrated on automating management of attached Elastic Block Store volumes. Backup Monkey makes sure you always have snapshot backups, and Graffiti monkey tags EBS volumes with information about the host they are attached to, so you can always figure out what is on each EBS volume or snapshot. By using monkeys to do this, we can be sure that every EBS volume is tagged the same way, and that we always have snapshot backups, even though creation of volumes can be done “self service” by any developer. Peter also contributed Ansible playbooks and many pre-built AMIs to make it easier for everyone else to get started with NetflixOSS, and put Asgard and Edda in the AWS Marketplace. He recently started his own company AnsWerS to help people who want to move to AWS.

Justin Santa Barbara (justinsb of San Franciso, California) decided to make the Chaos Monkey far more evil, and created fourteen new variants, a “barrel of chaos monkeys”. They interfere with the network, causing routing failure, packet loss, network data corruption and extra network latency. They block access to DNS, S3, DynamoDB and the EC2 control plane. They interfere with storage, by disconnecting EBS volumes, filling up the root disk, and saturating the disks with IO requests. They interfere with the CPU by consuming all the spare cycles, or killing off all the processes written in Python or Java. When run, a random selection is made, and the victims suffer the consequences. This is an excellent but scary workout for our monitoring and repair/replacement automation.

Our original Chaos Monkey framework keeps a small amount of state and logs what it does in SimpleDB, which is an AWS specific service. To make it easier to run Chaos Monkey in other clouds, or datacenter environments such as Eucalyptus or OpenStack John Gardner (huxoll, of Austin Texas) generalized the interface and provided a sample implementation that he calls “Monkey Recorder” which writes to local disk.

Keeping with the animal theme, we have some “piggy business”. Anyone familiar with the Hadoop tools and the big data ecosystem knows about the Pig language. It provides a way to specify a high level dataflow for processing but the Pig scripts can get complex and hard to debug. Netflix built and open sourced a visualization and monitoring tool called Lipstick, and it was adopted by a vendor called Mortardata (mortardata, of New York, NY) who worked with us to generalize some of the interfaces and integrate it with their own Pig based Hadoop platform. We saved Mortardata from having to create their own tool to do this, and Netflix now has an enthusiastic partner to help improve and extend Lipstick so everyone who uses it benefits.

Business computing is what IBM is known for. They have put in a lot of work and produced two related entries. IBM had previously created a demonstration application called Acme Air for their Websphere tools running on IBM Smartcloud. It was a fairly conventional enterprise architecture application, with a Java front end and a database back end. For their first prize entry, Andrew Spyker (aspyker, of Raleigh North Carolina) figured out how to re-implement Acme Air as a cloud native example application using NetflixOSS libraries and component services, running on AWS. He then ran some benchmark stress tests to demonstrate scalability. This was demonstrated at a Netflix Meetup last summer. Following on, a team led by Richard Johnson (EmergingTechnologyInstitute, of Raleigh North Carolina) including Andrew Spyker and Jonathan Bond ported several components of NetflixOSS to the IBM Softlayer cloud using Rightscale to provide autoscaling functionality. This involved ports of the Eureka service registry, Hystrix circuit breaker pattern, Karyon base server framework, Ribbon http client and Asgard provisioning portal. They even made a video demo of the final product and put it up on YouTube. This was a lot of work, but IBM sees the value in getting a deep understanding of Cloud Native architecture and tools, which it can then figure out how to apply to helping enterprise customers make the transition to cloud.

Acme Air is a fairly simple application with a web based user interface, but in the real world complex web service APIs are hard to manage, and NetflixOSS includes the Zuul API gateway, which is used to authenticate process and route http requests. The next nomination is from Neil Beveridge (neilbeveridge, of Kent, United Kingdom). He was interested in porting the Zuul container from Tomcat to Netty, which also provides non-blocking output requests, and benchmarking the difference. He ran into an interesting problem with Netty consuming excess CPU and running slower than the original Tomcat version, and then ran into the contest deadline, but plans to continue work to debug and tune the Netty code. Since Netflix is also looking at moving some of our services from Tomcat to Netty, this is a useful and timely contribution. It’s also helpful to other people considering using Zuul to have some published benchmarks to show the throughput on a common AWS instance type.

Eucalyptus have been using NetflixOSS to provide a proof point for portability of applications from AWS to private clouds based on Eucalyptus. In June 2013 they shipped a major update that included advanced AWS features such as Autoscale Groups that NetflixOSS depends on. To support the extra capabilities of Eucalyptus and the ability to deploy applications to AWS regions and Eucalyptus datacenters from the same Asgard console, Chris Grzegorczyk and Greg Dekoenigsberg (eucaflix, grze and gregdek, of Goleta California) made a series of changes to NetflixOSS projects which they submitted as a prize entry. They have demonstrated working code at several Netflix meetups.

Turbine is a real time monitoring mechanism that provides a continuous stream of updates for tracking the Hystrix circuit breaker pattern that is used to protect API calls from broken dependencies. Improvements by Michael Rose (Xorlev, of Lakewood, Colorado) extended Turbine so that it can be used in environments that are using Zookeeper for service discovery rather than Eureka.

Cheng Ma and Joe Gardner (xiaoma318& joehack3r, of Houston, Texas) built three related user interface tools MyEdda, MyJanitor and ASG Console to simplify operations monitoring. Edda collects a complete history of everything deployed in an AWS account, Janitor Monkey uses Edda to find entities such as unused empty Autoscaling Groups that it can remove. Fei Teng (veyronfei, of Sydney Australia) built Clouck, a user interface that keeps track of AWS resources across regions. These tools let you see what is going on more easily.

EC2box by Sean Kavanagh (skavanagh, of Louisville, Kentucky) is a web based ssh console that can replicate command line operations across large numbers of instances, and also acts as a control and audit point so that operations by many engineers can be coordinated and managed centrally.

When we started to build the Denominator library for portable DNS management we contacted Neustar to discuss their UltraDNS product, and made contact with Jeff Damick (jdamick, of South Riding, Virginia). His input as we structured the early versions of Denominator was extremely useful, and provides a great example of the power of developing code in public. We were able to tap into his years of experience with DNS management, and he was able to contribute code, tests and fixes to the Denominator code and fixes to the UltraDNS API itself.

Jakub Narloch (jmnarloch, of Szczecin, Poland) started out by helping to configure the JBoss Arquillian test framework to do unit tests for the Denominator DNS integration with UltraDNS. This work was extended to include Karyon, the base server that underpins NetflixOSS services and acts as the starting point for developing new services. Since Genie is based on Karyon, we were able to leverage this integration to use Arquilian to test Genie, and the changes have been merged into the code that Netflix uses internally.

Feign is a simple annotation based way to construct http clients for Java applications. It was developed as a side project of Denominator, and David Carr (davidmc24, of Clifton Park, New York) has emerged as a major contributor to Feign. As well as a series of pull requests that simplified common Feign use cases, he’s acted as a design reviewer for changes proposed by Netflix engineering, and we just proposed adding him as a committer on the project. This is another example of the value of developing code “in public”.

Netflix uses our Servo instrumentation library to annotate and collect information and post it into AWS Cloudwatch, but many people use a somewhat similar library developed by Coda Hale at Yammer, which is called Metrics. Maheedhar Gunturu (mailmahee, of Santa Clara, California) instrumented the Astyanax Cassandra client library to generate Yammer Metrics. Astyanax is one of the most widely used NetflixOSS projects, and it’s used in many non-AWS contexts so this is a useful generalization.

Priam is our Java Tomcat service that runs on each node in a Cassandra cluster to manage creation and backup of Cassandra. Sean McCully (seanmccully, of San Antonio, Texas) ported the functionality of Priam from Java to Python and called it Hector. This is useful in environments that don’t use Java or that run smaller Cassandra clusters on small instances where the reduced memory overhead of a Python implementation leaves more space for Cassandra itself.

Although the primary storage used by Netflix is based on Cassandra, we also use AWS RDS to create several small MySQL databases for specific purposes. Other AWS customers use RDS much more heavily. Jiaqui Guo (jiaqui, Chicago, Illinois) has built Datamung to automate backup of RDS to S3 and replication of backups across regions for disaster recovery.

Abdelmonaim Remani (PolymathicCoder, Capitola, California) has built a DynamoDB Framework that is similar to the way we use Astyanax for Cassandra. As well as providing a nice annotation based Java interface the framework adds extra functionality such as cross regional replication by managing access to multiple DynamoDB back ends.

The Reactive Extensions (Rx) pattern is one of the most advanced and powerful concepts for structuring code to come out in recent years. The original work on Rx at Microsoft by Eric Meijer inspired Netflix to create the RxJava project. We started with a subset of Rx functionality and left a lot of “to do” areas. This inspired Mairbek Khadikov (mairbek, Kharkiv, Ukraine) to help us fill in the missing features with over thirty pull requests for the RxJava project. As the project matured we began to extend RxJava to include other JVM based languages and Joachim Hofer (jmhofer, Möhrendorf, Germany) made major contribution to type safety and Scala support, again with over thirty pull requests.

When Paypal decided they wanted a developer oriented console for their OpenStack based private cloud they took a look at Asgard and realized that it was close enough to what they wanted, so they could use it as a starting point. Anand Palanisamy (paypal, San Jose, California) submitted Aurora, their fork of Asgard, as a prize entry, and demonstrated it running at one of the Netflix meetups.

Riot Games have adopted many NetflixOSS components and extended them to meet their own needs. They demonstrated their work at the summer NetflixOSS Meetup, and Asbjorn Kjaer (bunjiboys, Santa Monica, California) submitted Chef-Solo for AMInator, which integrates Chef recipes for building systems with the immutable AMI based instance model that Netflix uses for deployments.

We are pleased to have such a wide variety of nominations, from individuals around the world, small and large companies, vendors and end users. Many thanks to all of them for the work they have put into helping grow the NetflixOSS ecosystem, and thanks to everyone else who just uses NetflixOSS or entered the contest but didn’t make the cut. Next the judges will pick ten winners and we will be contacting them and secretly flying them to Las Vegas in November. There they will be announced, meet with the judges and the Netflix team and pick up their $10K prize money, $5K AWS credits and Cloud Monkey trophy.

by: Christos Kalantzis, Minh Do, Homajeet Cheema, Roman Vasilyev

One of the practices that sets Netflix apart from most companies is the belief that you can only know how good your software stack is by trying to making it fail. We've blogged about Chaos Monkey and how it helps identify deficiencies in your software stack. Netflix, has implemented Chaos Monkey on our mid-tier stateless systems, to great success.

We are pleased to announce that the Cloud Database Engineering (CDE) team has turned on Chaos Monkey on our Production C* Clusters.

How did we do it?

At the heart of being able to introduce Chaos Monkey to C* are 3 things:

Apache Cassandra’s Highly Available architecture.
Reliable Monitoring
Automatic Remediation

Apache Cassandra’s HA Architecture

Within the CAP theorem, C*’s shared-nothing data architecture and data replication makes it excel at AP. This allows us to “lose” C* nodes without affecting the overall usability of the C* Cluster.

Reliable Monitoring

The CDE team has gone to great lengths to understand the inner working of C* and how to expose metrics and detect the state of our clusters. We've used this knowledge to build reliable monitoring that can help us determine the real-time state of our C* Clusters. It also can distinguish between a transient AWS network issue or a real lost node which needs to be handled.

Automatic Remediation

One of the core values of Netflix is that all developers are responsible for operating their code. Since my developers and I are by nature lazy and like to sleep at night, we've developed automation around handling some of the most common states our C* Clusters face. One of those states is nodes being down. Our automatic remediation system will initiate a node replacement. Once the node has finished bootstrapping the data, the C* Cluster will once again be at full strength.

Workflow

Here is a representation of our workflow:

Netflix strongly believes that testing failure scenarios in production is the most reliable way to gain confidence in your software stack. If you find creating such automation and monitoring, right up your alley, visit jobs.netflix.com to join the CDE team.

by Kristofer Baxter

In the past we’ve written about HTML5 Video (HTML5 Video in IE11 on Windows 8.1, and HTML5 Video at Netflix) but we haven't spoken much about how we built the player UI. The UI Engineering team here at Netflix has been supporting HTML5 based playback for a little over a year, and now seems like the right time to discuss some of the strategies and techniques we are using to support video playback without a plugin.

One of our main objectives is to keep Netflix familiar to our members. That means we’re keeping the design of the HTML5 player consistent with our Silverlight experience. Features should be rolled out simultaneously for the two platforms. However, HTML5 users will enter playback faster, can enjoy 1080p content when GPU accelerated, and keep all the functionality they know and love.

Silverlight UI	HTML5 UI

In order to achieve a similar look and feel, we needed to recreate a few key elements of the Silverlight UI:

Scale interface to users resolution
Minimize Startup time via minimal dependency on data
Ensure High Performance on low end hardware

Scaling interface to users resolution

No matter what resolution the browser window used for playback is, our current playback UI ensures all of the controls maintain the same percentage size on screen. This lets users choose their own dimensions for playing content without the UI getting in the way.

Normally, a modern web application could implement this using CSS vw and vh units. However, we found this approach to be inadequate for our needs. Our player can be displayed in two fashions -- taking over the entire initial containing block of a viewport, or a smaller portion. To solve for this, we implemented a sizing scheme based entirely on font-relative lengths.

In this small example, you can see the scaling implementation in a direct form.

<style>
.netflix-player-wrapper {
        font-size:16px;
}
#netflix-player {
        position: absolute;
        width:90%; height:90%;
        left:5%; top:5%;
        overflow: hidden;
        background:#ccc;
        font-size:1em;
}
#player-sizing {
        position: absolute;
        width:1em; height:1em;
        visibility: hidden;
        font-size:1em;
}
#ten-percent-height {
        position: absolute;
        width:80%; height:10em;
        left:10%; bottom:8em;
        background:#000;
        display: flex;
}
#ten-percent-height > p {
        display: block;
        margin:1em;
        font-size:2em;
        color:#fff;
}
</style>
<divclass="netflix-player-wrapper">
<divid="netflix-player">
<divid="player-sizing"></div>
<divid="ten-percent-height"><p>Text</p></div>
</div>
</div>
<script>
(function(){
var sizingEl = document.getElementById("player-sizing"),
        controlWrapperEl = document.getElementById('netflix-player'),
        currentEmSize =1.0;

function resize(){
var wrapperHeight = controlWrapperEl.getBoundingClientRect().height,
            sizingHeight = sizingEl.getBoundingClientRect().height,
            wrapperOnePercentHeight = wrapperHeight /100,
            offsetSize;

if(sizingHeight > wrapperOnePercentHeight){
            offsetSize = sizingHeight / wrapperOnePercentHeight;
            currentEmSize = currentEmSize / offsetSize;
}elseif(wrapperOnePercentHeight > sizingHeight){
            offsetSize = wrapperOnePercentHeight / sizingHeight;
            currentEmSize = currentEmSize * offsetSize;
}
        controlWrapperEl.style.fontSize = currentEmSize +"em";
}

    window.addEventListener("resize", resize,false);
    resize();
})();
</script>

We implement this resizing functionality on a debounced interval in the player UI. Triggering it on every window resize would be wasteful.

By making an em unit represent 1% height of the "netflix-player" container, we can size all of our onscreen elements in a scaling manner - no matter how or where the netflix-player container is placed in the document.

Minimize Startup time via minimal dependency on data

Browser plugins like Flash and Silverlight can take several seconds to initialize, especially on a freshly booted machine. Now that we no longer need to initialize a plugin to play content, we can begin playback faster. However, we learned a lot about quick video startup in Silverlight, and can borrow techniques we developed to make our HTML5 UI launch content even faster.

When possible, allow playback to begin without title metadata.

If we already know which title the customer has selected to play (like a specific episode or movie), we can just start playback of that title immediately. Once the user has begun to buffer content, the UI can request display metadata. Metadata for the player can be a large payload since it includes episode data (title, synopsis, predicted rating), and is personalized to the user. By delaying the retrieval of metadata, users begin streaming 500 to 1200ms sooner in real-world usage.

For other conditions, such when a customer clicks play on a TV show and we want to start playback at the last episode that they were watching, we retrieve the specific episode the user wants before starting the playback process.

Populate controls which depend on rich data as that data becomes available.

Since we can begin playback before the player UI knows anything except which title to play, the player UI needs to be resilient against missing metadata. We display a minimal number of controls while this data is being requested. These controls include play/pause, exit playback, and full-screen toggling.

We use an eventing framework to let individual components know when data state has changed, so each component can stay decoupled. Here’s an example showing how we handle an event telling us the metadata is now loaded for the title.

function populateStatus(){
if(Metadata.videoIsKnown(ObjectPool.videoId())){
// Update Status to reflect current playing item.
}else{
// Hide or remove current status
}
}

Metadata.addEventListener(Metadata.knownEvents.METADATA_LOADED, populateStatus);

Ensure High Performance on all hardware

Not everyone has the latest and greatest hardware at their disposal, but this shouldn't prevent all sorts of devices from playing content on Netflix. To this end, we develop using a wide variety of hardware and test using a wide range of representative devices.

We’ve found the issues preventing great performance on low end hardware can mostly be avoided by adhering to the following best practices:

Avoid repaints and reflows whenever possible.

Reflows and repaints while playing content is quite costly to overall performance and battery life. As a result, we batch reads and writes to the DOM wherever possible. This helps us avoid accidental reflows.

Take advantage of getBoundingClientRect to determine the size of object.

This is a very fast way to get the dimensions of an object. However, it isn’t a free operation and results should be cached whenever possible.

Caching the size of objects when dragging, instead of recalculating them every time they are needed, is one such way to reduce the number of calls in quick succession.

function setupPointerData(e){
    pointerEventData.dimensions ={
        handleEl:  handleEl.getBoundingClientRect(),
        wrapperEl: wrapperEl.getBoundingClientRect()
};
    pointerEventData.drag ={
        start:{ value: currentValue, max: currentMax },
        pointer:{ x: e.pageX, y: e.pageY }
};
}

function pointerDownHandler(e){
if(handleEl.contains(e.target)){
if(!dragging){
            setupPointerData(e);
            dragging =true;
}
}
}

function pointerMoveHandler(e){
if(dragging && isValidEventLocation(e)){
if(!pointerEventData ||!pointerEventData.dimensions){
            setupPointerData(e);
}
// Use the handleEl dimensions, wrapperEl dimensions, 
// and the event values to change the DOM.
}
}

We have a lot of work planned

We’re working on exciting new features and constantly improving our HTML5 Video UI, and we’re looking for help. Our growing team is looking for experts to join us. If you’d like to apply, take a look here.

by Daniel Jacobson, Danny Yuan, Neeraj Joshi

To deliver the best possible experience to Netflix customers around the world, it is critical for us to maintain a robust, scalable, and resilient system. That is why we have built (and open sourced) applications ranging from Hystrix to Chaos Monkey. All of these tools better enable us to prevent or minimize outages, respond effectively to outages, and/or anticipate the kinds of operational gaps that may eventually result in outages. Recently we have built another such tool that has been helping us in this ongoing challenge: Scryer.

Scryer is a new system that allows us to provision the right number of AWS instances needed to handle the traffic of our customers. But Scryer is different from Amazon Auto Scaling (AAS), which reacts to real-time metrics and adjusts instance counts accordingly. Rather, Scryer predicts what the needs will be prior to the time of need and provisions the instances based on those predictions.

This post is the first in a series that will provide greater details on what Scryer is, how it works, how it differs from the Amazon Auto Scaling, and how we employ it at Netflix.

Amazon Auto Scaling and the Netflix Use Case

At the core, AAS is a reactive auto scaling model. That is, AAS dynamically adjusts server counts based on a cluster’s current workload (most often the metric of choice will be something like load average). When spiking or dropping beyond a certain point, AAS policies will trigger the addition or removal of instances. For Netflix, this has proven to be quite effective at improving system availability, optimizing costs, and in some cases reducing latencies. Overall, AAS is a big win and companies with any kind of scale in AWS should be employing this service.

For Netflix, however, there are a range of use cases that are not fully addressed by AAS. The following are some examples:

Rapid spike in demand: Instance startup times range from 10 to 45 minutes. During that time our existing servers are vulnerable, especially if the workload continues to increase.
Outages:A sudden drop in incoming traffic from an outage is sometimes followed by a retry storm (after the underlying issue has been resolved). A reactive system is vulnerable in such conditions because a drop in workload usually triggers a down scale event, leaving the system under provisioned to handle the ensuing retry storm.
Variable traffic patterns:Different times of the day have different workload characteristics and fleet sizes. Some periods show a rapid increase in workload with a relatively small fleet size (20% of maximum), while other periods show a modest increase with a fleet size 80% of the maximum, making it difficult to handle such variations in optimal ways.

Some of these issues can be mitigated by scaling up aggressively, but this is often undesirable as it may lead to scale up - scale down oscillations. Another option is to always run more servers than required which is clearly not optimal from a cost perspective.

Scryer: Our Predictive Auto Scaling Engine

Scryer was inspired in part by these unaddressed use cases, but its genesis was triggered more by our relatively predictable traffic patterns. The following is an example of five days worth of traffic:

In this chart, there are very clearly spikes and troughs that sync up with consistent patterns and times of day. There are definitely going to be spikes and valleys that we cannot predict and the traffic does evolve over longer periods of time. That said, over any given week or month, we have a very good idea of what the traffic will look like as the basic curves are the same. Moreover, these same five days of the week are likely to have the same patterns the week before and the week after (assuming no outages or special events).

Because of these trends, we believed that we would be able to generate a set of algorithms that could predict our capacity needs before our actual needs, rather than simply relying on the reactive model of AAS. The following chart shows the result of that effort, which is that the output from our prediction algorithms aligns very closely to our actual metrics.

Once these predictions were optimized, we attached these predictions to AWS APIs to trigger changes in capacity needs. The following chart shows that our scheduled scaling action plan closely matches our actual traffic with each step minimized to achieve best performance.

We have been running Scryer in production for a few months. The following is a list of the key benefits that we have seen with it:

Improved cluster performance
Better service availability
Reduced EC2 costs

Predictive-Reactive Auto Scaling - A Hybrid Approach

As effective as Scryer has been in predicting and managing our instance counts, the real strength of Scryer is in how it operates in tandem with AAS’s reactive model.

If we are able to predict the workload of a cluster in advance, then we can proactively scale the cluster ahead of time to accurately meet workload needs. But there will certainly be cases where Scryer cannot predict our needs, such as an unexpected surge in workload. In these cases, AAS serves as an excellent safety net for us, adding instances based on those unanticipated, unpredicted needs.

The two auto scaling systems combined provide a much more robust and efficient solution as they complement each other.

Conclusion

Overall, Scryer has been incredibly effective at predicting our metrics and traffic patterns, allowing us to better manage our instance counts and stabilizing our systems. We are still rolling this out to the breadth of services within Netflix and will continue to explore its use cases and optimize the algorithms. So far, though, we are excited about the results and are eager to see how it behaves in different environments and conditions.

In the coming weeks, we plan to publish several more posts discussing Scryer in greater detail, digging deeper into its features, design, technology and algorithms. We are exploring the possibility of open sourcing Scryer in the future as well.

Finally, we work on these kinds of exciting challenges all the time at Netflix. If you would like to join us in tackling such problems, check out our Jobs site.

We launched the Netflix Open Source Software Cloud Prize in March 2013 and it got a lot of attention in the press and blogosphere. Six months later we closed the contest, took a good look at the entrants, picked the best as nominees and the panel of distinguished judges decided the winners in each category, The final winners were announced during Werner Vogel’s keynote at AWS Re:Invent in on November 14th 2013.

Video of the Netflix section of the Keynote

The ten winners all put in a lot of work to earn their prizes. They each won a trip to Las Vegas and a ticket for AWS Re:Invent, a Cloud Monkey trophy, $10,000 prize money from Netflix and $5000 in AWS credits from Amazon. After the keynote was over we all went on stage to get our photo taken with Werner Vogels.

Peter Sankauskas (@pas256) is a software engineer living in Silicon Valley, and the founder of Answers for AWS (@Answers4AWS). He specializes in automation, scaling and Amazon Web Services. Peter contributed Ansible playbooks, Cloud Formation templates and many pre-built AMIs to make it easier for everyone else to get started with NetflixOSS, and put Asgard and Edda in the AWS Marketplace. He recently started his own company AnsWerS to help people who want to move to AWS. Getting started with NetflixOSS is made harder because there are 35 different projects to figure out. Peter has created an extremely useful and simple on-ramp for new NetflixOSS users.

Chris Grzegorczyk (eucaflix, grze, of Goleta California) is Chief Architect and Co-Founder, and Vic Iglesias is Quality and Release Manager at Eucalyptus Systems. Eucalyptus have been using NetflixOSS to provide a proof point for portability of applications from AWS to private clouds based on Eucalyptus. Eucalyptus is open source software for building private and hybrid clouds that are compatible with AWS APIs. Their submission enables NetflixOSS projects to treat Eucalyptus as an additional AWS region and to deploy applications to AWS regions and Eucalyptus datacenters from the same Asgard console.

In June 2013 they shipped a major update to Eucalyptus that included advanced AWS features such as Autoscale Groups that NetflixOSS depends on. Eucalyptus have demonstrated working code at several Netflix meetups and have really helped promote the NetflixOSS ecosystem.

IBM had previously created a demonstration application called Acme Air for their Websphere tools running on IBM Smartcloud. It was a fairly conventional enterprise architecture application, with a Java front end and a database back end. For their winning prize entry, Andrew Spyker (aspyker, of Raleigh North Carolina) figured out how to re-implement Acme Air as a cloud native example application using NetflixOSS libraries and component services, running on AWS. He then ran some benchmark stress tests to demonstrate scalability. This was demonstrated at a Netflix Meetup last summer. The Acme Air example application combines several NetflixOSS projects. The Eureka service registry, Hystrix circuit breaker pattern, Karyon base server framework, Ribbon http client and Asgard provisioning portal. IBM used NetflixOSS to get a deeper understanding of Cloud Native architecture and tools, which it can apply to helping enterprise customers make the transition to cloud.

The Reactive Extensions (Rx) pattern is one of the most advanced and powerful concepts for structuring code to come out in recent years. The original work on Rx at Microsoft by Eric Meijer inspired Netflix to create the RxJava project. We started with a subset of Rx functionality and left a lot of “to do” areas. As the project matured we began to extend RxJava to include other JVM based languages and Joachim Hofer (jmhofer, Möhrendorf, Germany) has made major contribution to type safety and Scala support, with over thirty pull requests.

Joachim works at Imbus AG, Möhrendorf, Germany, he’s lead developer of an agile product team and a Scala enthusiast working on moving their stack from J2EE to Scala/Play/Akka/Spray/RxJava.

Anyone familiar with Hadoop tools and the big data ecosystem knows about the Pig language. It provides a way to specify a high level dataflow for processing but the Pig scripts can get complex and hard to debug. Netflix built and open sourced a visualization and monitoring tool called Lipstick, and it was adopted by Mark Roddy at a vendor called Mortar (mortardata, of New York, NY) who worked with us to generalize some of the interfaces and integrate it with their own Pig based Hadoop platform. We saved Mortar from having to create their own tool to do this, and Netflix now has an enthusiastic partner to help to improve and extend Lipstick so everyone who uses it benefits.

Jakub Narloch (jmnarloch, of Szczecin, Poland) created a test suite for NetflixOSS Karyon based on JBoss Arquillian. The extension integrated with Karyon/Google Guice dependency injection functionality allowing to write tests that directly access the application auto scanned components. The tests are executed in the application container. Arquillian brings wide support for different containers including Tomcat, Jetty and JBoss AS. Karyon is the base server that underpins NetflixOSS services and acts as the starting point for developing new services. Since Genie is based on Karyon, we were able to leverage this integration to use Arquilian to test Genie, and the changes have been merged into the code that Netflix uses internally.

Jakub Narloch is a software engineer working at Samsung Electronics. He received the JBoss Community Recognition Award this year for his open source contributions. In the past year he has been actively helping to develop the JBoss Arquillian project, authoring four completely new extensions and helping to shape many others. His adventure with the open source world began a couple of years earlier and he has also contributed code to projects like Spring Framework, Castor XML and NetflixOSS. Last year he graduated with honors Computer Science from Warsaw University of Technology with an MSc degree. In the past he took part in two editions of Google Summer of Code and in his free time he likes to solve the software development contests held by TopCoder Inc.

In the real world complex web service APIs are hard to manage, and NetflixOSS includes the Zuul API gateway, which is used to authenticate process and route http requests. The next winner is Neil Beveridge (neilbeveridge, of Kent, United Kingdom). He was interested in porting the Zuul container from Tomcat to Netty, which also provides non-blocking output requests, and benchmarking the difference. Neil ran the benchmarks with help from Raamnath Mani, Fanta Gizaw and Will Tomlin at Hotels.com. They ran into an interesting problem with Netty consuming excess CPU and running slower than the original Tomcat version, and then ran into the contest deadline, but have since continued work to debug and tune the Netty code and come up with higher performance for Netty and some comparisons of cloud and bare metal performance for Zuul. Since Netflix is also looking at moving some of our services from Tomcat to Netty, this is a useful and timely contribution. It’s also helpful to other people considering using Zuul to have some published benchmarks to show the throughput on common AWS instance types.

Although the primary storage used by Netflix is based on Cassandra, we also use AWS RDS to create many small MySQL databases for specific purposes. Other AWS customers use RDS much more heavily. Jiaqi Guo (jiaqi, Chicago, Illinois) has built Datamung to automate backup of RDS to S3 and replication of backups across regions for disaster recovery. Datamung is a web-based, Simple Workflow driven application that backs up RDS MySQL databases into S3 objects by launching an EC2 instance and running the mysqldump command. It makes it possible to replicate RDS across regions, VPC, accounts or outside the AWS network.

We are pleased to have such a wide variety of winners, from individuals around the world, small and large companies, vendors and end users. Many thanks to all of them for the work they have put into helping grow the NetflixOSS ecosystem, and thanks to everyone else who just uses NetflixOSS or entered the contest but didn’t make the final cut.

The winners, judges and support team got "Cloud Monkey" trophies custom made by Bleep Labs.

At Netflix, we are committed to bringing new features and product enhancements to customers rapidly and frequently. Consequently, dozens of engineering teams are constantly innovating on their services, resulting in a rate of change to the overall service is vast and unending. Because of our appetite for providing such improvements to our members, it is critical for us to maintain a Software Delivery pipeline that allows us to deploy changes to our production environments in an easy, seamless, and quick way, with minimal risk.

The Challenge
The Netflix API is responsible for distributing content that supports our features to the UIs. The API in turn, is dependent on a large number of services that are responsible for generating the data. We often use an hourglass metaphor to describe the Netflix technology stack, with the API being the slender neck.

In the picture above, the changes in content from the services at the bottom of the hourglass need to be delivered in a rapid fashion to the dozens of UIs at the top. The UI teams are on their own rapid iteration cycles, trying to create or modify user experiences for our customers based on that content. The API is the primary conduit for the flow of this content between the services and UIs. This imposes a unique challenge when it comes to our development, testing and deployment flows: we need to move as fast as or faster than the aggregate of the UIs we serve and the services from which we consume. In order to deliver on that challenge and help us keep up with the demands of the business, we are continually investing in our Software Delivery pipeline, with a view towards Continuous Delivery as our end goal.

We have already described the technology and automation that powers our global, multi-region deployments in an earlier blog post. In this post, we will delve into the earlier stages of our pipeline, focusing on our approach to generating an artifact that is ready for deployment. Automation and Insight continue to remain key themes through this part of our pipeline just as they are on the deployment side.

Delivery pipeline
The diagram below illustrates the flow of code through our Delivery pipeline. The rest of this post describes the steps denoted by the gray boxes.

Let’s take a look at how a commit makes its way through the pipeline. A developer starts by making code changes for a feature or bug fix.First, s/he makes a decision as to which branch in the source tree to work off:

Release - on a Weekly deployment schedule
Prod - on a Daily deployment schedule

There are no hard and fast rules regarding which branch to use; just broad guidelines. Typically, we use the Release branch for changes that meet one of the following criteria:

Require additional testing by people outside of the API Team
Require dependency library updates
Need additional "bake-time" before they can be deployed to Production

[1] Developers build and run tests in their local environment as part of their normal workflow. They make the determination as to the appropriate level of testing that will give them confidence in their change. They have the capability to wrap their changes with Feature Flags, which can be used to enable or disable code paths at runtime.

[2] Once a change is committed to the repository, it is part of the Delivery pipeline.

[3][4] Every commit triggers a Build/Test/deploy cycle on our Continuous Integration server. If the build passes a series of gates, it is deployed to one of our Pre-Production environments[5]; depending on the branch in source that the build was generated from. We maintain several environments to satisfy the varied requirements of our user base, which consists of Netflix internal Dev and Test teams as well as external partners. With this process, we ensure that every code change is available for use by our partners within roughly an hour of being committed to the repository.

[6] If the change was submitted to the Prod branch or a manual push trigger was invoked, the build is deployed to the canary cluster as well. It is worth mentioning that our unit of deployment is actually an Amazon Machine Image (ami), which is generated from a successful build. For the purposes of this post though, we'll stick to the term “build”.

Dependency Management

This is a good time to take a deeper look at our approach to dependency management. In the Netflix SOA implementation, services expose their interfaces to consumers through client libraries that define the contract between the two services. Those client libraries get incorporated into the consuming service, introducingbuild and runtime dependencies for the service. This is highly magnified in the case of the API because of its position in the middle of the hourglass, as depicted at the beginning of the post. With dozens of dependencies and hundreds of client libraries, getting a fully functional version of the API server stood up can in itself be tricky. To add to that, at any point in time, several of these libraries are being updated to support critical new functionality or bug fixes. In order to keep up with the volume and rate of change, we have a separate instance of our pipeline dedicated towards library integration. This integration pipeline builds an API server with the latest releases of all client libraries and runs a suite of functional tests against it. If that succeeds, it generates a configuration file with hardcoded library versions and commits it into SCM. Our developers and the primary build pipeline use this locked set of libraries, which allows us to insulate the team from instabilities introduced by client library changes but still keep up with library updates.

Testing

Next, we explore our test strategy in detail. This is covered in steps 1-4 in the pipeline diagram above. It is helpful to look at our testing in terms of the Software Testing Pyramid to get a sense for our multi-layered approach.

Local Development/Test iterations

Developers have the ability to run unit and functional tests against a local server to validate their changes. Tests can be run at a much more granular level than on the CI server; so as to keep the run times short and support iterative development.

Continuous Integration

Once code has been checked in, the CI servers take over. The goal is to assess the health of the branch in the SCM repository with every commit, including automated code merges between various branches as well as dependency updates. The inability to generate a deployable artifact or failed unit tests cause the build to fail and block the pipeline.

Contract validations

Our interface to our consumers is via a Java API that uses the Functional Reactive model implemented by RxJava. The API is consumed by Groovy scripts that our UI teams develop. These scripts are compiled dynamically at runtime within an API server, which is when an incompatible change to the API would be detected. Our build process includes a validation step based on API signature checks to catch such breakages earlier in the cycle. Any non-additive changes to the API as compared to the version that’s currently in Production fail the build and halt the pipeline.

Integration Tests

A build that has passed unit and contract validation tests gets deployed to the CI server where we test it for readiness to serve traffic. After the server is up and running with the latest build, we run functional tests against it. Given our position in the Netflix stack, any test against the API is by definition an integration test. Running integration tests reliably in a live, cloud-based, distributed environment poses some interesting challenges. For instance, each test request results in multiple network calls on the API server, which implies variability in response times. This makes tests susceptible to timing related failures. Further, the API is designed to degrade gracefully by providing partial or fallback responses in case of a failure or latency in a backend service. So responses are not guaranteed to be static. Several backend services provide personalized data which is dynamic in nature. Factors such as these contribute towards making tests non-deterministic, which can quickly diminish the value of our test suites.

We follow a few guiding principles and techniques to keep the non-determinism under control.

Focus on the structure rather than the content of the responses from the API. We leave content verification to the services that are responsible for generating the data. Tests are designed with this in mind.
Account for cold caches and connections by priming them prior to test execution.
Design the system to provide feedback when it is returning fallback responses or otherwise running in a degraded mode so tests can leverage that information.
Quarantine unreliable tests promptly and automate the process of associating them to tasks in our bug tracking system .

These techniques have allowed us to keep the noise in the test results down to manageable levels. A test failure in our primary suite blocks the build from getting deployed anywhere.

Another key check we perform at this stage is related to server startup time, which we track across builds. If this metric crosses a certain threshold, the pipeline is halted.

User Acceptance Tests

If a build has made it to this point, it is ready to be deployed to one or more internal environments for user-acceptance testing. Users could be UI developers implementing a new feature using the API, UI Testers performing end-to-end testing or automated UI regression tests. As far as possible, we strive to not have user-acceptance tests be a gating factor for our deployments. We do this by wrapping functionality in Feature Flags so that it is turned off in Production while testing is happening in other environments.

Canary Analysis

This is the final step in the process before the code is deployed globally. A build that has reached this step is a candidate for Production deployment from a functional standpoint. The canary process analyzes the readiness of the build from a systems perspective. It is described in detail in the post on Deploying the API.

Insight - tracking changes

Continuous Delivery is all about feedback loops. As a change makes its way through the pipeline, the automation delivers the results of each step to the concerned person(s) in real time via email and/or chat notifications. Build and Test results are also captured and made available on our Build Status Dashboard for later viewing. Everyone committing a change to the Release or Prod branch is responsible for ensuring that the branch is in a deployable state after their change. The Build Dash always displays the latest status of each branch, including a history of recent commits and test runs. Members of our Delivery Team are on a weekly rotation to ensure smooth operation of the pipelines.

Conclusion

The work we’ve done so far has been helpful in increasing our agility and our ability to scale with the growth of the business. But as Netflix grows and our systems evolve, there are newer challenges to overcome . We are experimenting in different areas like developer productivity, improved insight into code usage and optimizing the pipeline for speed. If any of these sound exciting, take a look at our open roles.