Quantcast
Channel: Netflix TechBlog - Medium
Viewing all 303 articles
Browse latest View live

Netflix at AWS re:Invent 2016

$
0
0
Like many of our tech blog readers, Netflix is getting ready for AWS re:Invent in Las Vegas next week. Lots of Netflix engineers and recruiters will be in attendance, and we're looking forward to meeting and reconnecting with cloud enthusiasts and Netflix OSS users.

To make it a little easier to find our speakers at re:Invent, we're posting the schedule of Netflix talks here. We'll also have a booth on the expo floor and hope to see you there!

AES119 - Boosting Your Developers Productivity and Efficiency by Moving to a Container-Based Environment with Amazon ECS
Wednesday, November 30 3:40pm (Executive Summit)
Neil Hunt, Chief Product Officer

Abstract: Increasing productivity and encouraging more efficient ways for development teams to work is top of mind for nearly every IT leader. In this session, Neil Hunt, Chief Product Officer at Netflix, will discuss why the company decided to introduce a container-based approach in order to speed development time, improve resource utilization, and simplify the developer experience. Learn about the company’s technical and business goals, technology choices and tradeoffs it had to make, and benefits of using Amazon ECS.

ARC204 - From Resilience to Ubiquity - #NetflixEverywhere Global Architecture
Tuesday, November 29 9:30am
Thursday, December 1 12:30pm
Coburn Watson, Director, Performance and Reliability

Abstract: Building and evolving a pervasive, global service requires a multi-disciplined approach that balances requirements with service availability, latency, data replication, compute capacity, and efficiency. In this session, we’ll follow the Netflix journey of failure, innovation, and ubiquity. We'll review the many facets of globalization and then delve deep into the architectural patterns that enable seamless, multi-region traffic management; reliable, fast data propagation; and efficient service infrastructure. The patterns presented will be broadly applicable to internet services with global aspirations.

BDM306 - Netflix: Using Amazon S3 as the fabric of our big data ecosystem
Tuesday, November 29, 5:30pm
Wednesday, November 30, 12:30pm
Eva Tse, Director, Big Data Platform
Kurt Brown, Director, Data Platform

Abstract: Amazon S3 is the central data hub for Netflix's big data ecosystem. We currently have over 1.5 billion objects and 60+ PB of data stored in S3. As we ingest, transform, transport, and visualize data, we find this data naturally weaving in and out of S3. Amazon S3 provides us the flexibility to use an interoperable set of big data processing tools like Spark, Presto, Hive, and Pig. It serves as the hub for transporting data to additional data stores / engines like Teradata, Redshift, and Druid, as well as exporting data to reporting tools like Microstrategy and Tableau. Over time, we have built an ecosystem of services and tools to manage our data on S3. We have a federated metadata catalog service that keeps track of all our data. We have a set of data lifecycle management tools that expire data based on business rules and compliance. We also have a portal that allows users to see the cost and size of their data footprint. In this talk, we’ll dive into these major uses of S3, as well as many smaller cases, where S3 smoothly addresses an important data infrastructure need. We will also provide solutions and methodologies on how you can build your own S3 big data hub.

CON313 - Netflix: Container Scheduling, Execution, and Integration with AWS
Thursday, December 1, 2:00pm
Andrew Spyker, Manager, Netflix Container Cloud

Abstract: Members from over all over the world streamed over forty-two billion hours of Netflix content last year. Various Netflix batch jobs and an increasing number of service applications use containers for their processing. In this session, Netflix presents a deep dive on the motivations and the technology powering container deployment on top of Amazon Web Services. The session covers our approach to resource management and scheduling with the open source Fenzo library, along with details of how we integrate Docker and Netflix container scheduling running on AWS. We cover the approach we have taken to deliver AWS platform features to containers such as IAM roles, VPCs, security groups, metadata proxies, and user data. We want to take advantage of native AWS container resource management using Amazon ECS to reduce operational responsibilities. We are delivering these integrations in collaboration with the Amazon ECS engineering team. The session also shares some of the results so far, and lessons learned throughout our implementation and operations.

DEV209 - Another Day in the Life of a Netflix Engineer
Wednesday, November 30, 4:00pm
Dave Hahn, Senior SRE

Abstract: Netflix is big. Really big. You just won't believe how vastly, hugely, mind-bogglingly big it is. Netflix is a large, ever changing, ecosystem system serving million of customers across the globe through cloud-based systems and a globally distributed CDN. This entertaining romp through the tech stack serves as an introduction to how we think about and design systems, the Netflix approach to operational challenges, and how other organizations can apply our thought processes and technologies. We’ll talk about: 
  • The Bits - The technologies used to run a global streaming company 
  • Making the Bits Bigger - Scaling at scale 
  • Keeping an Eye Out - Billions of metrics 
  • Break all the Things - Chaos in production is key 
  • DevOps - How culture affects your velocity and uptime

DEV311 - Multi-Region Delivery Netflix Style
Thursday, December 1, 1:00pm
Andrew Glover, Engineering Manager

Abstract: Netflix rapidly deploys services across multiple AWS accounts and regions over 4,000 times a day. We’ve learned many lessons about reliability and efficiency. What’s more, we’ve built sophisticated tooling to facilitate our growing global footprint. In this session, you’ll learn about how Netflix confidently delivers services on a global scale and how, using best practices combined with freely available open source software, you can do the same.

MBL204 - How Netflix Achieves Email Delivery at Global Scale with Amazon SES
Tuesday, November 29, 10:00am
Devika Chawla, Engineering Director

Abstract: Companies around the world are using Amazon Simple Email Service (Amazon SES) to send millions of emails to their customers every day, and scaling linearly, at cost. In this session, you learn how to use the scalable and reliable infrastructure of Amazon SES. In addition, Netflix talks about their advanced Messaging program, their challenges, how SES helped them with their goals, and how they architected their solution for global scale and deliverability.

NET304 - Moving Mountains: Netflix's Migration into VPC
Thursday, December 1, 3:30pm
Andrew Braham, Manager, Cloud Network Engineering
Laurie Ferioli, Senior Program Manager

Abstract: Netflix was one of the earliest AWS customers with such large scale. By 2014, we were running hundreds of applications in Amazon EC2. That was great, until we needed to move to VPC. Given our scale, uptime requirements, and the decentralized nature of how we manage our production environment, the VPC migration (still ongoing) presented particular challenges for us and for AWS as it sought to support our move. In this talk, we discuss the starting state, our requirements and the operating principles we developed for how we wanted to drive the migration, some of the issues we ran into, and how the tight partnership with AWS helped us migrate from an EC2-Classic platform to an EC2-VPC platform.

SAC307 - The Psychology of Security Automation
Thursday, December 1, 12:30pm
Friday, December 2, 11:00am
Jason Chan, Director, Cloud Security

Abstract: Historically, relationships between developers and security teams have been challenging. Security teams sometimes see developers as careless and ignorant of risk, while developers might see security teams as dogmatic barriers to productivity. Can technologies and approaches such as the cloud, APIs, and automation lead to happier developers and more secure systems? Netflix has had success pursuing this approach, by leaning into the fundamental cloud concept of self-service, the Netflix cultural value of transparency in decision making, and the engineering efficiency principle of facilitating a “paved road.”

This session explores how security teams can use thoughtful tools and automation to improve relationships with development teams while creating a more secure and manageable environment. Topics include Netflix’s approach to IAM entity management, Elastic Load Balancing and certificate management, and general security configuration monitoring.

STG306 - Tableau Rules of Engagement in the Cloud
Thursday, December 1, 1:00pm
Srikanth Devidi, Senior Data Engineer
Albert Wong, Enterprise Reporting Platform Manager

Abstract: You have billions of events in your fact table, all of it waiting to be visualized. Enter Tableau… but wait: how can you ensure scalability and speed with your data in Amazon S3, Spark, Amazon Redshift, or Presto? In this talk, you’ll hear how Albert Wong and Srikanth Devidi at Netflix use Tableau on top of their big data stack. Albert and Srikanth also show how you can get the most out of a massive dataset using Tableau, and help guide you through the problems you may encounter along the way. 


More Efficient Mobile Encodes for Netflix Downloads

$
0
0

Last January, Netflix launched globally, reaching many new members in 130 countries around the world. In many of these countries, people access the internet primarily using cellular networks or still-developing broadband infrastructure. Although we have made strides in delivering the same or better video quality with less bits (for example, with per-title encode optimization), further innovation is required to improve video quality over low-bandwidth unreliable networks. In this blog post, we summarize our recent work on generating more efficient video encodes, especially targeted towards low-bandwidth Internet connections. We refer to these new bitstreams as our mobile encodes.

Our first use case for these streams is the recently launched downloads feature on Android and iOS.

What’s new about our mobile encodes

We are introducing two new types of mobile encodes - AVCHi-Mobile and VP9-Mobile. The enhancements in the new bitstreams fall into three categories: (1) new video compression formats, (2) more optimal encoder settings, and (3) per-chunk bitrate optimization. All the changes combined result in better video quality for the same bitrate compared to our current streams (AVCMain).

New compression formats

Many Netflix-ready devices receive streams which are encoded using the H.264/AVC Main profile (AVCMain). This is a widely-used video compression format, with ubiquitous decoder support on web browsers, TVs, mobile devices, and other consumer devices. However, newer formats are available that offer more sophisticated video coding tools. For our mobile bitstreams we adopt two compression formats: H.264/AVC High profile and VP9 (profile 0). Similar to Main profile, the High profile of H.264/AVCenjoys broad decoder support. VP9, a royalty-free format developed by Google, is supported on the majority of Android devices, Chrome, and a growing number of consumer devices.

High profile of H.264/AVC shares the general architecture of H.264/AVC Main profile and among other features, offers other tools that increase compression efficiency. The tools from High profile that are relevant to our use case are:
  • 8x8 transforms and Intra 8x8 prediction
  • Quantization scaling matrices
  • Separate Cb and Cr control

VP9 has a number of tools which bring improvements in compression efficiency over H.264/AVC, including:
  • Motion-predicted blocks of sizes up to 64×64
  • 1/8th pel motion vectors
  • Three switchable 8-tap subpixel interpolation filters
  • Better coding of motion vectors
  • Larger discrete cosine transforms (DCT, 16×16, and 32×32)
  • Asymmetric discrete sine transform (ADST)
  • Loop filtering adapted to new block sizes
  • Segmentation maps

More optimal encoder settings

Apart from using new coding formats, optimizing encoder settings allows us to further improve compression efficiency. Examples of improved encoder settings are as follows:
  • Increased random access picture period: This parameter trades off encoding efficiency with granularity of random access points.
  • More consecutive B-frames or longer Alt-ref distance: Allowing the encoder to flexibly choose more B-frames in H.264/AVC or longer distance between Alt-ref frames in VP9 can be beneficial, especially for slowly changing scenes.
  • Larger motion search range: Results in better motion prediction and fewer intra-coded blocks.
  • More exhaustive mode evaluation: Allows an encoder to evaluate more encoding options at the expense of compute time.

Per-chunk encode optimization

In our parallel encoding pipeline, the video source is split up into a number of chunks, each of which is processed and encoded independently. For our AVCMain encodes, we analyze the video source complexity to select bitrates and resolutions optimized for that title. Whereas our AVCMain encodes use the same average bitrate for each chunk in a title, the mobile encodes optimize the bitrate for each individual chunk based on its complexity (in terms of motion, detail, film grain, texture, etc). This reduces quality fluctuations between the chunks and avoids over-allocating bits to chunks with less complex content.

Video compression results

In this section, we evaluate the compression performance of our new mobile encodes. The following configurations are compared:
  • AVCMain:  Our existing H.264/AVC Main profile encodes, using per-title optimization, serve asanchorforthecomparison.
  • AVCHi-Mobile: H.264/AVC High profile encodes using more optimal encoder settings and per-chunk encoding.  
  • VP9-Mobile: VP9 encodes using more optimal encoder settings and per-chunk encoding.

The results were obtained on a sample of 600 full-length popular movies or TV episodes with 1080p source resolution (which adds up to about 85 million frames). We encode multiple quality points (with different resolutions), to account for different bandwidth conditions of our members.

In our tests, we calculate PSNR and VMAF to measure video quality.  The metrics are  computed after scaling the decoded videos to the original 1080p source resolution. To compare the average compression efficiency improvement, we use Bjontegaard-delta rate (BD-rate), a measure widely used in video compression. BD-rate indicates the average change in bitrate that is needed for a tested configuration to achieve the same quality as the anchor. The metric is calculated over a range of bitrate-quality points and interpolates between them to get an estimate of the relative performance of two configurations.

The graph below illustrates the results of the comparison. The bars represent BD-rate gains, and higher percentages indicate larger bitrate savings.The AVCHi-Mobile streams can deliver the same video quality at 15% lower bitrate according to PSNR and at 19% lower bitrate according to VMAF. The VP9-Mobile streams show more gains and can deliver an average of 36% bitrate savings according to PSNR and VMAF. This demonstrates that using the new mobile encodes requires significantly less bitrate for the same quality.



Viewing it another way, members can now receive better quality streams for the same bitrate. This is especially relevant for members with slow or expensive internet connectivity. The graph below illustrates the average quality (in terms of VMAF) at different available bit budgets for the video bitstream. For example, at 1 Mbps, our AVCHi-Mobile and VP9-Mobile streams show an average VMAF increase of 7 and 10, respectively, over AVC-Main. These gains represent noticeably better visual quality for the mobile streams.


How can I watch with the new mobile encodes?

Last month, we started re-encoding our catalog to generate the new mobile bitstreams and the effort is ongoing. The mobile encodes are being used in the brand new downloads feature. In the near future, we will also use these new bitstreams for mobile streaming to broaden the benefit for Netflix members, no matter how they’re watching.


By Andrey Norkin, Jan De Cock, Aditya Mavlankar and Anne Aaron

NetflixOSS: Announcing Hollow

$
0
0
“If you can cache everything in a very efficientway,
you can often change the game”

We software engineers often face problems that require the dissemination of a dataset which doesn’t fit the label “big data”.  Examples of this type of problem include:

  • Product metadata on an ecommerce site
  • Document metadata in a search engine
  • Metadata about movies and TV shows on an Internet television network

When faced with these we usually opt for one of two paths:

  • Keep the data in a centralized location (e.g. an RDBMS, nosql data store, or memcached cluster) for remote access by consumers
  • Serialize it (e.g. as json, XML, etc) and disseminate it to consumers which keep a local copy

Scaling each of these paths presents different challenges. Centralizing the data may allow your dataset to grow indefinitely large, but:

  • There are latency and bandwidth limitations when interacting with the data
  • A remote data store is never quite as reliable as a local copy of the data

On the other hand, serializing and keeping a local copy of the data entirely in RAM can allow many orders of magnitude lower latency and higher frequency access, but this approach has scaling challenges that get more difficult as a dataset grows in size:

  • The heap footprint of the dataset grows
  • Retrieving the dataset requires downloading more bits
  • Updating the dataset may require significant CPU resources or impact GC behavior

Engineers often select a hybrid approach — cache the frequently accessed data locally and go remote for the “long-tail” data.  This approach has its own challenges:

  • Bookkeeping data structures can consume a significant amount of the cache heap footprint
  • Objects are often kept around just long enough for them to be promoted and negatively impact GC behavior

At Netflix we’ve realized that this hybrid approach often represents a false savings.  Sizing a local cache is often a careful balance between the latency of going remote for many records and the heap requirement of keeping more data local.  However, if you can cache everything in a very efficient way, you can often change the game — and get your entire dataset in memory using less heap and CPU than you would otherwise require to keep just a fraction of it.  This is where Hollow, Netflix’s latest OSS project comes in.


logo.png


Hollow is a java library and comprehensive toolset for harnessing small to moderately sized in-memory datasets which are disseminated from a single producer to many consumers for read-only access.

“Hollow shifts the scale
datasets for which such liberation
may never previously have been considered
can be candidates for Hollow.”


Performance

Hollow focuses narrowly on its prescribed problem set: keeping an entire, read-only dataset in-memory on consumers.  It circumvents the consequences of updating and evicting data from a partial cache.

Due to its performance characteristics, Hollow shifts the scale in terms of appropriate dataset sizes for an in-memory solution. Datasets for which such liberation may never previously have been considered can be candidates for Hollow.  For example, Hollow may be entirely appropriate for datasets which, if represented with json or XML, might require in excess of 100GB.

Agility

Hollow does more than simply improve performance it also greatly enhances teams’ agility when dealing with data related tasks.

Right from the initial experience, using Hollow is easy.  Hollow will automatically generate a custom API based on a specific data model, so that consumers can intuitively interact with the data, with the benefit of IDE code completion.

But the real advantages come from using Hollow on an ongoing basis.  Once your data is Hollow, it has more potential.  Imagine being able to quickly shunt your entire production dataset current or from any point in the recent past down to a local development workstation, load it, then exactly reproduce specific production scenarios.  

Choosing Hollow will give you a head start on tooling; Hollow comes with a variety of ready-made utilities to provide insight into and manipulate your datasets.

Stability

How many nines of reliability are you after?  Three, four, five?  Nine?  As a local in-memory data store, Hollow isn’t susceptible to environmental issues, including network outages, disk failures, noisy neighbors in a centralized data store, etc.  If your data producer goes down or your consumer fails to connect to the data store, you may be operating with stale data but the data is still present and your service is still up.

Hollow has been battle-hardened over more than two years of continuous use at Netflix. We use it to represent crucial datasets, essential to the fulfillment of the Netflix experience, on servers busily serving live customer requests at or near maximum capacity.  Although Hollow goes to extraordinary lengths to squeeze every last bit of performance out of servers’ hardware, enormous attention to detail has gone into solidifying this critical piece of our infrastructure.

Origin

Three years ago we announced Zeno, our then-current solution in this space.  Hollow replaces Zeno but is in many ways its spiritual successor.
deltachain.gif
Zeno’s concepts of producer, consumers, data states, snapshots and deltas are carried forward into Hollow

As before, the timeline for a changing dataset can be broken down into discrete data states, each of which is a complete snapshot of the data at a particular point in time.  Hollow automatically produces deltas between states; the effort required on the part of consumers to stay updated is minimized.  Hollow deduplicates data automatically to minimize the heap footprint of our datasets on consumers.

Evolution

Hollow takes these concepts and evolves them, improving on nearly every aspect of the solution.  

Hollow eschews POJOs as an in-memory representation instead replacing them with a compact, fixed-length, strongly typed encoding of the data.  This encoding is designed to both minimize a dataset’s heap footprint and to minimize the CPU cost of accessing data on the fly.  All encoded records are packed into reusable slabs of memory which are pooled on the JVM heap to avoid impacting GC behavior on busy servers.

memlayout-object.png
An example of how OBJECT type records are laid out in memory

Hollow datasets are self-containedno use-case specific code needs to accompany a serialized blob in order for it to be usable by the framework.  Additionally, Hollow is designed with backwards compatibility in mind so deployments can happen less frequently.  

“Allowing for the construction of
powerful access patterns, whether
or not they were originally anticipated
while designing the data model.”

Because Hollow is all in-memory, tooling can be implemented with the assumption that random access over the entire breadth of the dataset can be accomplished without ever leaving the JVM heap.  A multitude of prefabricated tools ship with Hollow, and creation of your own tools using the basic building blocks provided by the library is straightforward.

Core to Hollow’s usage is the concept of indexing the data in various ways.  This enables O(1) access to relevant records in the data, allowing for the construction of powerful access patterns, whether or not they were originally anticipated while designing the data model.

Benefits

Tooling for Hollow is easy to set up and intuitive to understand.  You’ll be able to gain insights into your data about things you didn’t know you were unaware of.

The history tool allows for inspecting the changes in specific records over time

Hollow can make you operationally powerful.  If something looks wrong about a specific record, you can pinpoint exactly what changed and when it happened with a simple query into the history tool.  If disaster strikes and you accidentally publish a bad dataset, you can roll back your dataset to just before the error occurred, stopping production issues in their tracks.  Because transitioning between states is fast, this action can take effect across your entire fleet within seconds.

“Once your data is Hollow, it has more potential.”

Hollow has been enormously beneficial at Netflix we've seen server startup times and heap footprints decrease across the board in the face of ever-increasing metadata needs.  Due to targeted data modeling efforts identified through detailed heap footprint analysis made possible by Hollow, we will be able to continue these performance improvements.

In addition to performance wins, we've seen huge productivity gains related to the dissemination of our catalog data.  This is due in part to the tooling that Hollow provides, and in part due to architectural choices which would not have been possible without it.

Conclusion

Everywhere we look, we see a problem that can be solved with Hollow.  Today, Hollow is available for the whole world to take advantage of.  

Hollow isn’t appropriate for datasets of all sizes.  If the data is large enough, keeping the entire dataset in memory isn’t feasible.  However, with the right framework, and a little bit of data modeling, that threshold is likely much higher than you think.

Documentation is available at http://hollow.how, and the code is available on GitHub.  We recommend diving into the quick start guide— you’ll have a demo up and running in minutes, and a fully production-scalable implementation of Hollow at your fingertips in about an hour.  From there, you can plug in your data model and it’s off to the races.

Once you get started, you can get help from us directly or from other users via Gitter, or by posting to Stack Overflow with the tag “hollow”.

By Drew Koszewnik

Netflix Conductor : A microservices orchestrator

$
0
0
The Netflix Content Platform Engineering team runs a number of business processes which are driven by asynchronous orchestration of tasks executing on microservices.  Some of these are long running processes spanning several days. These processes play a critical role in getting titles ready for streaming to our viewers across the globe.

A few examples of these processes are:

  • Studio partner integration for content ingestion
  • IMF based content ingestion from our partners
  • Process of setting up new titles within Netflix
  • Content ingestion, encoding, and deployment to CDN

Traditionally, some of these processes had been orchestrated in an ad-hoc manner using a combination of pub/sub, making direct REST calls, and using a database to manage the state.  However, as the number of microservices grow and the complexity of the processes increases, getting visibility into these distributed workflows becomes difficult without a central orchestrator.

We built Conductor “as an orchestration engine” to address the following requirements, take out the need for boilerplate in apps, and provide a reactive flow :

  • Blueprint based. A JSON DSL based blueprint defines the execution flow.
  • Tracking and management of workflows.
  • Ability to pause, resume and restart processes.
  • User interface to visualize process flows.
  • Ability to synchronously process all the tasks when needed.
  • Ability to scale to millions of concurrently running process flows.
  • Backed by a queuing service abstracted from the clients.
  • Be able to operate over HTTP or other transports e.g. gRPC.

Conductor was built to serve the above needs and has been in use at Netflix for almost a year now. To date, it has helped orchestrate more than 2.6 million process flows ranging from simple linear workflows to very complex dynamic workflows that run over multiple days.

Today, we are open sourcing Conductor to the wider community hoping to learn from others with similar needs and enhance its capabilities.  You can find the developer documentation for Conductor here.

Why not peer to peer choreography?

With peer to peer task choreography, we found it was harder to scale with growing business needs and complexities.  Pub/sub model worked for simplest of the flows, but quickly highlighted some of the issues associated with the approach:
  • Process flows are “embedded” within the code of multiple applications
  • Often, there is tight coupling and assumptions around input/output, SLAs etc, making it harder to adapt to changing needs
  • Almost no way to systematically answer “What is remaining for a movie's setup to be complete”?

Why Microservices?

In a microservices world, a lot of business process automations are driven by orchestrating across services. Conductor enables orchestration across services while providing control and visibility into their interactions. Having the ability to orchestrate across  microservices also helped us in leveraging existing services to build new flows or update existing flows to use Conductor very quickly, effectively providing an easier route to adoption.  

Architectural Overview


At the heart of the engine is a state machine service aka Decider service. As the workflow events occur (e.g. task completion, failure etc.), Decider combines the workflow blueprint with the current state of the workflow, identifies the next state, and schedules appropriate tasks and/or updates the status of the workflow.

Decider works with a distributed queue to manage scheduled tasks.  We have been using dyno-queues on top of Dynomite for managing distributed delayed queues. The queue recipe was open sourced earlier this year and hereis the blog post.

Task Worker Implementation

Tasks, implemented by worker applications, communicate via the API layer. Workers achieve this by either implementing a REST endpoint that can be called by the orchestration engine or by implementing a polling loop that periodically checks for pending tasks. Workers are intended to be idempotent stateless functions. The polling model allows us to handle backpressure on the workers and provide auto-scalability based on the queue depth when possible. Conductor provides APIs to inspect the workload size for each worker that can be used to autoscale worker instances.


Worker communication with the engine

API Layer

The APIs are exposed over HTTP - using HTTP allows for ease of integration with different clients. However, adding another protocol (e.g. gRPC) should be possible and relatively straightforward.

Storage

We use Dynomite“as a storage engine” along with Elasticsearch for indexing the execution flows. The storage APIs are pluggable and can be adapted for various storage systems including traditional RDBMSs or Apache Cassandra like no-sql stores.

Key Concepts

Workflow Definition

Workflows are defined using a JSON based DSL.  A workflow blueprint defines a series of tasks that needs be executed.  Each of the tasks are either a control task (e.g. fork, join, decision, sub workflow, etc.) or a worker task.  Workflow definitions are versioned providing flexibility in managing upgrades and migration.

An outline of a workflow definition:
{
 "name": "workflow_name",
 "description": "Description of workflow",
 "version": 1,
 "tasks": [
   {
     "name": "name_of_task",
     "taskReferenceName": "ref_name_unique_within_blueprint",
     "inputParameters": {
       "movieId": "${workflow.input.movieId}",
       "url": "${workflow.input.fileLocation}"
     },
     "type": "SIMPLE",
     ... (any other task specific parameters)
   },
   {}
   ...
 ],
 "outputParameters": {
   "encoded_url": "${encode.output.location}"
 }
}

Task Definition

Each task’s behavior is controlled by its template known as task definition. A task definition provides control parameters for each task such as timeouts, retry policies etc. A task can be a worker task implemented by application or a system task that is executed by orchestration server.  Conductor provides out of the box system tasks such as Decision, Fork, Join, Sub Workflows, and an SPI that allows plugging in custom system tasks. We have added support for HTTP tasks that facilitates making calls to REST services.

JSON snippet of a task definition:
{
 "name": "encode_task",
 "retryCount": 3,
 "timeoutSeconds": 1200,
 "inputKeys": [
   "sourceRequestId",
   "qcElementType"
 ],
 "outputKeys": [
   "state",
   "skipped",
   "result"
 ],
 "timeoutPolicy": "TIME_OUT_WF",
 "retryLogic": "FIXED",
 "retryDelaySeconds": 600,
 "responseTimeoutSeconds": 3600
}

Inputs / Outputs

Input to a task is a map with inputs coming as part of the workflow instantiation or output of some other task. Such configuration allows for routing inputs/outputs from workflow or other tasks as inputs to tasks that can then act upon it. For example, the output of an encoding task can be provided to a publish task as input to deploy to CDN.

JSON snippet for defining task inputs:
{
     "name": "name_of_task",
     "taskReferenceName": "ref_name_unique_within_blueprint",
     "inputParameters": {
       "movieId": "${workflow.input.movieId}",
       "url": "${workflow.input.fileLocation}"
     },
     "type": "SIMPLE"
   }

An Example

Let’s look at a very simple encode and deploy workflow:



There are a total of 3 worker tasks and a control task (Errors) involved:

  1. Content Inspection: Checks the file at input location for correctness/completeness
  2. Encode: Generates a video encode
  3. Publish: Publishes to CDN

These three tasks are implemented by different workers which are polling for pending tasks using the task APIs. These are ideally idempotent tasks that operate on the input given to the task, performs work, and updates the status back.

As each task is completed, the Decider evaluates the state of the workflow instance against the blueprint (for the version corresponding to the workflow instance) and identifies the next set of tasks to be scheduled, or completes the workflow if all tasks are done.

UI

The UI is the primary mechanism of monitoring and troubleshooting workflow executions. The UI provides much needed visibility into the processes by allowing searches based on various parameters including input/output parameters, and provides a visual presentation of the blueprint, and paths it has taken, to better understand process flow execution. For each workflow instance, the UI provides details of each task execution with the following details:

  • Timestamps for when the task was scheduled, picked up by the worker and completed.
  • If the task has failed, the reason for failure.
  • Number of retry attempts
  • Host on which the task was executed.
  • Inputs provided to the task and output from the task upon completion.

Here’s a UI snippet from a kitchen sink workflow used to generate performance numbers:




Other solutions considered

Amazon SWF

We started with an early version using a simple workflow from AWS. However, we chose to build Conductor given some of the limitations with SWF:
  • Need for blueprint based orchestration, as opposed to programmatic deciders as required by SWF.
  • UI for visualization of flows.
  • Need for more synchronous nature of APIs when required (rather than purely message based)
  • Need for indexing inputs and outputs for workflow and tasks and ability to search workflows based on that.
  • Need to maintain a separate data store to hold workflow events to recover from failures, search etc.

Amazon Step Function
Recently announced AWS Step Functions added some of the features we were looking for in an orchestration engine. There is a potential for Conductor to adopt the states language to define workflows.

Some Stats

Below are some of the stats from the production instance we have been running for a little over a year now. Most of these workflows are used by content platform engineering in supporting various flows for content acquisition, ingestion and encoding.

Total Instances created YTD
2.6 Million
No. of distinct workflow definitions
100
No. of unique workers
190
Avg no. of tasks per workflow definition
6
Largest Workflow
48 tasks

Future considerations

  • Support for AWS Lambda (or similar) functions as tasks for serverless simple tasks.
  • Tighter integration with container orchestration frameworks that will allow worker instance auto-scalability.
  • Logging execution data for each task. We think this is a useful addition that helps in troubleshooting.
  • Ability to create and manage the workflow blueprints from the UI.
  • Support for states language.
If you like the challenges of building distributed systems and are interested in building the Netflix studio ecosystem and the content pipeline at scale, check out ourjob openings.

By Viren Baraiya, Vikram Singh


Netflix Now Supports Ultra HD 4K on Windows 10 with New Intel Core Processors

$
0
0
We're excited to bring Netflix support for Ultra HD 4K to Windows 10, making the vast catalog of Netflix TV shows and movies in 4K even more accessible for our members around the world to watch in the best picture quality.

For the last several years, we've been working with our partners across the spectrum of CE devices to add support for the richer visual experience of 4K.  Since launching on Smart TVs in 2014, many different devices can now play our 4K content, including Smart TVs, set top boxes and game consoles.  We are pleased to add Windows 10 and 7th Gen Intel® Core™ CPUs to that list.

Microsoft and Intel both did great work to enable 4K on their platforms.  Intel added support for new, more efficient codecs necessary to stream 4K as well as hardware-based content security in their latest CPUs.  Microsoft enhanced the Edge browser with the latest HTML5 video support and made it work beautifully with Intel's latest processors.  The sum total is an enriched Netflix experience. Thanks to Microsoft's Universal Windows Platform, our app for Windows 10 includes the same 4K support as the Edge browser.

As always, you can enjoy all of our movies and TV shows on all supported platforms. We are working hard with our partners to further expand device support of 4K. An increasing number of our Netflix originals are shot, edited, and delivered in this format, with more than 600 hours available to watch, such as Stranger Things, The Crown, Gilmore Girls: A Year in the Life and Marvel's Luke Cage.

By Matt Trunnell, Nick Eddy, and Greg Wallace-Freedman

Crafting a high-performance TV user interface using React

$
0
0

The Netflix TV interface is constantly evolving as we strive to figure out the best experience for our members. For example, after A/B testing, eye-tracking research, and customer feedback we recently rolled out video previews to help members make better decisions about what to watch. We’ve written before about how our TV application consists of an SDK installed natively on the device, a JavaScript application that can be updated at any time, and a rendering layer known as Gibbon. In this post we’ll highlight some of the strategies we’ve employed along the way to optimize our JavaScript application performance.

React-Gibbon

In 2015, we embarked on a wholesale rewrite and modernization of our TV UI architecture. We decided to use React because its one-way data flow and declarative approach to UI development make it easier to reason about our app. Obviously, we’d need our own flavor of React since at that time it only targeted the DOM. We were able to create a prototype that targeted Gibbon pretty quickly. This prototype eventually evolved into React-Gibbon and we began to work on building out our new React-based UI.

React-Gibbon’s API would be very familiar to anyone who has worked with React-DOM. The primary difference is that instead of divs, spans, inputs etc, we have a single “widget” drawing primitive that supports inline styling.

React.createClass({
   render() {
       return <Widget style={{ text: 'Hello World', textSize: 20 }} />;
   }
});

Performance is a key challenge

Our app runs on hundreds of different devices, from the latest game consoles like the PS4 Pro to budget consumer electronics devices with limited memory and processing power. The low-end machines we target can often have sub-GHz single core CPUs, low memory and limited graphics acceleration. To make things even more challenging, our JavaScript environment is an older non-JIT version of JavaScriptCore. These restrictions make super responsive 60fps experiences especially tricky and drive many of the differences between React-Gibbon and React-DOM.

Measure, measure, measure

When approaching performance optimization it’s important to first identify the metrics you will use to measure the success of your efforts. We use the following metrics to gauge overall application performance:

  • Key Input Responsiveness - the time taken to render a change in response to a key press
  • Time To Interactivity - the time to start up the app
  • Frames Per Second - the consistency and smoothness of our animations
  • Memory Usage


The strategies outlined below are primarily aimed at improving key input responsiveness. They were all identified, tested and measured on our devices and are not necessarily applicable in other environments. As with all “best practice” suggestions it is important to be skeptical and verify that they work in your environment, and for your use case. We started off by using profiling tools to identify what code paths were executing and what their share of the total render time was; this lead us to some interesting observations.

Observation: React.createElement has a cost

When Babel transpiles JSX it converts it into a number of React.createElement function calls which when evaluated produce a description of the next Component to render. If we can predict what the createElement function will produce, we can inline the call with the expected result at build time rather than at runtime.


// JSX
render() {
   return <MyComponent key='mykey' prop1='foo' prop2='bar' />;
}

// Transpiled
render() {
   return React.createElement(MyComponent, { key: 'mykey', prop1: 'foo', prop2: 'bar' });
}

// With inlining
render() {
   return {
       type: MyComponent,
       props: {
           prop1: 'foo',
           prop2: 'bar'
       },
       key: 'mykey'
   };
}


As you can see we have removed the cost of the createElement call completely, a triumph for the “can we just not?” school of software optimization.

We wondered whether it would be possible to apply this technique across our whole application and avoid calling createElement entirely. What we found was that if we used a ref on our elements, createElement needs to be called in order to hook up the owner at runtime. This also applies if you’re using the spread operator which may contain a ref value (we’ll come back to this later).

We use a custom Babel plugin for element inlining, but there is an official plugin that you can use right now. Rather than an object literal, the official plugin will emit a call to a helper function that is likely to disappear thanks to the magic of V8 function inlining. After applying our plugin there were still quite a few components that weren’t being inlined, specifically Higher-order Components which make up a decent share of the total components being rendered in our app.

Problem: Higher-order Components can’t use Inlining

We love Higher-order Components (HOCs) as an alternative to mixins. HOCs make it easy to layer on behavior while maintaining a separation of concerns. We wanted to take advantage of inlining in our HOCs, but we ran into an issue: HOCs usually act as a pass-through for their props. This naturally leads to the use of the spread operator, which prevents the Babel plug-in from being able to inline.

When we began the process of rewriting our app, we decided that all interactions with the rendering layer would go through declarative APIs. For example, instead of doing:

componentDidMount() {
   this.refs.someWidget.focus()
}

In order to move application focus to a particular Widget, we instead implemented a declarative focus API that allows us to describe what should be focused during render like so:

render() {
   return <Widget focused={true} />;
}

This had the fortunate side-effect of allowing us to avoid the use of refs throughout the application. As a result we were able to apply inlining regardless of whether the code used a spread or not.


// before inlining
render() {
   return <MyComponent {...this.props} />;
}

// after inlining
render() {
   return {
       type: MyComponent,
       props: this.props
   };
}
This greatly reduced the amount of function calls and property merging that we were previously having to do but it did not eliminate it completely.

Problem: Property interception still requires a merge

After we had managed to inline our components, our app was still spending a lot of time merging properties inside our HOCs. This was not surprising, as HOCs often intercept incoming props in order to add their own or change the value of a particular prop before forwarding on to the wrapped component.

We did analysis of how stacks of HOCs scaled with prop count and component depth on one of our devices and the results were informative.


Screenshot 2017-01-11 12.31.30.png


They showed that there is a roughly linear relationship between the number of props moving through the stack and the render time for a given component depth.

Death by a thousand props

Based on our findings we realized that we could improve the performance of our app substantially by limiting the number of props passed through the stack. We found that groups of props were often related and always changed at the same time. In these cases, it made sense to group those related props under a single “namespace” prop. If a namespace prop can be modeled as an immutable value, subsequent calls to shouldComponentUpdate calls can be optimized further by checking referential equality rather than doing a deep comparison. This gave us some good wins but eventually we found that we had reduced the prop count as much as was feasible. It was now time to resort to more extreme measures.

Merging props without key iteration

Warning, here be dragons! This is not recommended and most likely will break many things in weird and unexpected ways.

After reducing the props moving through our app we were experimenting with other ways to reduce the time spent merging props between HOCs. We realized that we could use the prototype chain to achieve the same goals while avoiding key iteration.


// before proto merge
render() {
   const newProps = Object.assign({}, this.props, { prop1: 'foo' })
   return <MyComponent {...newProps} />;
}

// after proto merge
render() {
   const newProps = { prop1: 'foo' };
   newProps.__proto__ = this.props;
   return {
       type: MyComponent,
       props: newProps
   };
}

In the example above we reduced the 100 depth 100 prop case from a render time of ~500ms to ~60ms. Be advised that using this approach introduced some interesting bugs, namely in the event that this.props is a frozen object . When this happens the prototype chain approach only works if the __proto__ is assigned after the newProps object is created. Needless to say, if you are not the owner of newProps it would not be wise to assign the prototype at all.

Problem: “Diffing” styles was slow

Once React knows the elements it needs to render it must then diff them with the previous values in order to determine the minimal changes that must be applied to the actual DOM elements. Through profiling we found that this process was costly, especially during mount - partly due to the need to iterate over a large number of style properties.

Separate out style props based on what’s likely to change

We found that often many of the style values we were setting were never actually changed. For example, say we have a Widget used to display some dynamic text value. It has the properties text, textSize, textWeight and textColor. The text property will change during the lifetime of this Widget but we want the remaining properties to stay the same. The cost of diffing the 4 widget style props is spent on each and every render. We can reduce this by separating out the things that could change from the things that don't.

const memoizedStylesObject = { textSize: 20, textWeight: ‘bold’, textColor: ‘blue’ };


<Widget staticStyle={memoizedStylesObject} style={{ text: this.props.text }} />

If we are careful to memoize the memoizedStylesObject object, React-Gibbon can then check for referential equality and only diff its values if that check proves false. This has no effect on the time it takes to mount the widget but pays off on every subsequent render.

Why not avoid the iteration all together?

Taking this idea further, if we know what style props are being set on a particular widget, we can write a function that does the same work without having to iterate over any keys. We wrote a custom Babel plugin that performed static analysis on component render methods. It determines which styles are going to be applied and builds a custom diff-and-apply function which is then attached to the widget props.


// This function is written by the static analysis plugin
function __update__(widget, nextProps, prevProps) {
   var style = nextProps.style,
       prev_style = prevProps && prevProps.style;


   if (prev_style) {
       var text = style.text;
       if (text !== prev_style.text) {
           widget.text = text;
       }
   } else {
       widget.text = style.text;
   }
}


React.createClass({
   render() {
       return (
           <Widget __update__={__update__} style={{ text: this.props.title }}  />
       );
   }
});


Internally React-Gibbon looks for the presence of the “special” __update__ prop and will skip the usual iteration over previous and next style props, instead applying the properties directly to the widget if they have changed. This had a huge impact on our render times at the cost of increasing the size of the distributable.

Performance is a feature

Our environment is unique, but the techniques we used to identify opportunities for performance improvements are not. We measured, tested and verified all of our changes on real devices. Those investigations led us to discover a common theme: key iteration was expensive. As a result we set out to identify merging in our application, and determine whether they could be optimized. Here’s a list of some of the other things we’ve done in our quest to improve performance:

  • Custom Composite Component - hyper optimized for our platform
  • Pre-mounting screens to improve perceived transition time
  • Component pooling in Lists
  • Memoization of expensive computations

Building a Netflix TV UI experience that can run on the variety of devices we support is a fun challenge. We nurture a performance-oriented culture on the team and are constantly trying to improve the experiences for everyone, whether they use the Xbox One S, a smart TV or a streaming stick. Come join us if that sounds like your jam!


Coming Soon: Netflix Android Bootcamp

$
0
0
Mobile devices have had an incredible impact on people’s lives in the past few years and, as expected, Netflix members have embraced these devices for significant amounts of online video viewing. Just over a year ago, Netflix expanded from streaming in 60 countries to over 190 countries today. In many of our newer countries, mobile usage is outsized compared with other platforms. As Netflix has grown globally, we have a corresponding desire to invest more time and energy in creating great experiences on mobile.


Demand for Android engineers in the Bay Area has only increased over time and great mobile engineers are rare. To help develop great mobile talent, we’re excited to announce that we are sponsoring an upcoming Android bootcamp by CodePath. CodePath runs intensive training programs for practicing engineers who wish to develop knowledge and skill on a new platform (Android in this case).


We have chosen to partner with them for the following reasons:
  1. There are a number of CodePath graduates at Netflix who have demonstrated a thorough understanding of the new platform upon class completion. Each has testified to CodePath’s success as a challenging program that rapidly prepared them for mobile development.
  2. CodePath’s program requires a rigorous demonstration of aptitude and dedication. Both Netflix and CodePath seek excellence.
  3. The CodePath admissions process is blind to gender and race, creating a level playing field for people who are often underrepresented in tech. We share the goal of creating a diverse and inclusive environment for individuals from a variety of backgrounds.


There is no cost for participants to attend. For practicing software engineers, the primary requirement is dedicating your time and energy to learn a new platform. We aim to accept 15 engineers into the program on-site to keep the class size small. Additional seats will be available to remote students.


Sessions will be held at our headquarters in Los Gatos. Classes start March 6, 2017, continuing each Monday and Wednesday evening for 8 weeks. We look forward to having you there to develop your skills on this fun and exciting platform.






By Greg Benson, Android Innovation team

Netflix Hack Day - Winter 2017

$
0
0


We hosted another great Hack Day event a week ago at Netflix headquarters.  Hack Day is a way for our product development team to take a break from everyday work, have fun, experiment with new technologies, and collaborate with new people.


Each Hack Day, we see a wide range of ideas on how to improve the product and internal tools as well as some that are just meant to be fun.  This event was no different.   We’ve embedded videos below, produced by the hackers, to some of our favorites. You can also see more hacks from several of our past events:May 2016, November 2015,March 2015,Feb. 2014&Aug. 2014.


While we’re excited about the creativity and thought put into these hacks, they may never become part of the Netflix product, internal infrastructure, or otherwise be used beyond Hack Day.  We are posting them here publicly to share the spirit of the event and our culture of innovation.


Thanks again to the hackers who, in just 24 hours, assembled really innovative hacks.  




Stranger Bling

MDAS (Mobile Demogorgon Alerting System) a.k.a Ugly Christmas Sweater
LEDs soldered to a Stranger Things sweater, controlled wirelessly with an Arduino, spelling out messages from the Upside Down!




MindFlix

Navigate and control Netflix with your mind (with help from a Muse headband).




Netflix for Good

After watching socially conscious titles on Netflix, Netflix for Good allows users to donate to related and well known organizations from inside the Netflix app.




Stranger Games

Stranger Things re-imagined as a home console video game collection created back in 1983, but with a twist.




Picture in Picture

See what other profiles on your account are watching via picture in picture.


Introducing HubCommander

$
0
0
By Mike Grima, Andrew Spyker, and Jason Chan

Netflix is pleased to announce the open source release of HubCommander, a ChatOps tool for GitHub management.

Why HubCommander?

Netflix uses GitHub, a source code management and collaboration site, extensively for both open source and internal projects. The security model for GitHub does not permit users to perform repository management without granting administrative permissions. Management of many users on GitHub can be a challenge without tooling. We needed to provide enhanced security capabilities while maintaining developer agility. As such, we created HubCommander to provide these capabilities in a method optimized for Netflix.

Why ChatOps?

Our approach leverages ChatOps, which utilizes chat applications for performing operational tasks. ChatOps is increasingly popular amongst developers, since chat tools are ubiquitous, provide a single context for what actions occurred when and by whom, and also provides an effective means to provide self-serviceability to developers.

How Netflix leverages GitHub:

All Netflix owned GitHub repositories reside within multiple GitHub organizations. Organizations contain the git repositories and the users that maintain them. Users can be added into teams, and teams are given access to individual repositories. In this model, a GitHub user would get invited to an organization from an administrator. Once invited, the user becomes a member of the organization, and is placed into one or more teams.

At Netflix, we have several organizations that serve specific purposes. We have our primary OSS organization “Netflix”, our “Spinnaker” organization that is dedicated to our OSS continuous delivery platform, and a skunkworks organization, “Netflix-Skunkworks”, for projects that are in rough development that may or may not become fully-fledged OSS projects, to name a few.

Challenges we face:

One of the biggest challenges with using GitHub organizations is user management. GitHub organizations are individual entities that must be separately administered. As such, the complexity of user management increases with the number of organizations. To reduce complexity, we enforce a consistent permissions model across all of our organizations. This allows us to develop tools to simplify and streamline our GitHub organization administration.

How we apply security to our GitHub organizations:

The permissions model that we follow is one that applies the principle of least privilege, but is still open enough so that developers can obtain the access they need and move fast. The general structure we utilize is to have all employees placed under an employee’s team that has “push” (write) access to all repositories. We similarly have teams for “bot” accounts to provide for automation. Lastly, we have very few users with the “owner” role, as owners are full administrators that can make changes to the organization itself.

While we permit our developers to have write access to all of our repositories, we do not directly permit them to create, delete, or change repository visibility. Additionally, all developers are required to have multi-factor authentication enabled. All of our developers on GitHub have their IDs linked in our internal employee tracking system, and GitHub membership to our organizations is removed when employees leave the company automatically (we have scripts to automate this).

We also enable third-party application restrictions on our organizations to only allow specific third party GitHub applications access to our repositories.

Why is tooling required?

We want to have self-service tooling that provides an equivalent amount of usability as providing users with administrative access, but without the risk of making all users administrators.

Our tooling provides a consistent permissions model across all of our GitHub organizations. It also empowers our users to perform privileged operations on GitHub in a consistent and supported manner, while limiting their individual GitHub account permissions.

Because we limited individual GitHub account permissions, this can be problematic for developers when creating repositories, since they also want to update the description, homepage, and even set default branches. Many of our developers also utilize Travis CI for automated builds. Travis CI enablement requires that users be administrators of their repositories, which we do not permit. Our developers also work with teams outside of Netflix to collaborate with on projects. Our developers do not have permissions to invite users to our organizations or to add outside collaborators to repositories. This is where HubCommander comes in.

The HubCommander Bot

HubCommander is a Slack bot for GitHub organizational management. It provides a ChatOps means for administering GitHub organizations. HubCommander operates by utilizing a privileged account on GitHub to perform administrative capabilities on behalf of our users. Our developers issue commands to the bot to perform their desired actions. This has a number of advantages:
  1. Self-Service: By providing a self-service mechanism, we have significantly reduced our administrative burden for managing our GitHub repositories. The reduction in administrative overhead has significantly simplified our open source efforts.
  2. Consistent and Supported: The bot performs all of the tasks that are required for operating on GitHub. For example, when creating repositories, the bot will automatically provide the correct teams access to the new repository.
  3. Least Privilege for Users: Because the bot can perform the tasks that users need to perform, we can reduce the GitHub API permissions on our users.
  4. Developer Familiarity: ChatOps is very popular at Netflix, so utilizing a bot for this purpose is natural for our developers.
  5. Easy to Use: The bot is easy to use by having an easily discoverable command structure.
  6. Secure: The bot also features integration with Duo for additional authentication.

HubCommander Features:

Out of the box, HubCommander has the following features:
  • Repository creation
  • Repository description and website modification
  • Granting outside collaborators specific permissions to repositories
  • Repository default branch modification
  • Travis CI enablement
  • Duo support to provide authentication to privileged commands
  • Docker image support
HubCommander is also extendable and configurable. You can develop authentication and command based plugins. At Netflix, we have developed a command plugin which allows our developers to invite themselves to any one of our organizations. When they perform this process, their GitHub ID is automatically linked in our internal employee tracking system. With this linkage, we can automatically remove their GitHub organization membership when they leave the company.
Duo is also supported to add additional safeguards for privileged commands. This has the added benefit of protecting against accidental command issuance, as well as the event of Slack credentials getting compromised. With the Duo plugin, issuing a command will also trigger a "Duo push" to the employee’s device. The command only continues to execute if the request is approved. If your company doesn’t use Duo, you can develop your own authentication plugin to integrate with any internal or external authentication system to safeguard commands.
Using the Bot:
Using the bot is as easy as typing !help in the Slack channel. This will provide a list of commands that HubCommander supports:
To learn how to issue a specific command, simply issue that command without any arguments. HubCommander will output the syntax for the command. For example, to create a new repository, you would issue the !CreateRepo command:
If you are safeguarding commands with Duo (or your own authentication plugin), an example of that flow would look like this:
Contributions:
These features are only a starting point, and we plan on adding more soon. If you’d like to extend these features, we’d love contributions to our repository on GitHub.


Introducing Netflix Stethoscope

$
0
0

Netflix is pleased to announce the open source release of Stethoscope, our first project following a User Focused Security approach.

The notion of “User Focused Security” acknowledges that attacks
against corporate users (e.g., phishing, malware) are the primary
mechanism leading to security incidents and data breaches, and it’s one of the core principles driving our approach to corporate
information security. It’s also reflective of our philosophy that tools are only effective when they consider the true context of people’s work.

Stethoscope is a web application that collects information for a given user’s devices and gives them clear and specific recommendations for securing their systems.

If we provide employees with focused, actionable information and low-friction tools, we believe they can get their devices into a more secure state without heavy-handed policy enforcement.


Software that treats people like people, not like cogs in the machine

We believe that Netflix employees fundamentally want to do the right thing, and, as a company, we give people the freedom to do their work as they see fit. As we say in the Netflix Culture Deck, responsible people thrive on freedom, and are worthy of freedom. This isn’t just a nice thing to say–we believe people are most productive and effective when they they aren’t hemmed in by excessive rules and process.

That freedom must be respected by the systems, tools, and procedures we design, as well.

By providing personalized, actionable information–and not relying on automatic enforcement–Stethoscope respects people’s time, attention, and autonomy, while improving our company’s security outcomes.

If you have similar values in your organization, we encourage you to give Stethoscope a try.

Education, not automatic enforcement

It’s important to us that people understand what simple steps they can take to improve the security state of their devices, because personal devices–which we don’t control–may very well be the first target of attack for phishing, malware, and other exploits. If they fall for a phishing attack on their personal laptop, that may be the first step in an attack on our systems here at Netflix.

We also want people to be comfortable making these changes themselves, on their own time, without having to go to the help desk.

To make this self service, and so people can understand the reasoning behind our suggestions, we show additional information about each suggestion, as well as a link to detailed instructions.

Security practices

We currently track the following device configurations, which we call “practices”:

  • Disk encryption
  • Firewall
  • Automatic updates
  • Up-to-date OS/software
  • Screen lock
  • Not jailbroken/rooted
  • Security software stack (e.g., Carbon Black)

Each practice is given a rating that determines how important it is. The more important practices will sort to the top, with critical practices highlighted in red and collected in a top banner.

Implementation and data sources

Stethoscope is powered by a Python backend and a React front end. The web application doesn’t have its own data store, but directly queries various data sources for device information, then merges that data for display.


The various data sources are implemented as plugins, so it should be relatively straightforward to add new inputs. We currently support LANDESK (for Windows), JAMF (for Macs), and Google MDM (for mobile devices).

Notifications

In addition to device status, Stethoscope provides an interface for viewing and responding to notifications.

For instance, if you have a system that tracks suspicious application accesses, you could choose to present a notification like this:


We recommend that you only use these alerts when there is an action for somebody to take–alerts without corresponding actions are often confusing and counterproductive.

Mobile friendly

The Stethoscope user interface is responsive, so it’s easy to use on mobile devices. This is especially important for notifications, which should be easy for people to address even if they aren’t at their desk.

What’s next?

We’re excited to work with other organizations to extend the data sources that can feed into Stethoscope. Osquery is next on our list, and there are many more possible integrations.

Getting started

Stethoscope is available now on GitHub. If you’d like to get a feel for it, you can run the front end with sample data with a single command. We also have a Docker Compose configuration for running the full application.

Join us!

We hope that other organizations find Stethoscope to be a useful tool, and we welcome contributions, especially new plugins for device data.

Our team, Information Security, is also hiring a Senior UI Engineer at our Los Gatos office. If you’d like to help us work on Stethoscope and related tools, please apply!

Presentations

We’d like to thank ShmooCon for giving us the chance to present this work earlier this year. The slides and video are now both publicly available.

by Jesse Kriss and Andrew White

Netflix Downloads on Android

$
0
0
By Greg Benson, Francois Goldfain, and Ashish Gupta

Netflix is now a global company, so we wanted to provide a viewing experience that was truly available everywhere even when the Internet is not working well. This led to these three prioritized download use cases:
  1. Better, uninterrupted videos on unreliable Internet
  2. Reducing mobile data usage
  3. Watching Netflix without an Internet connection (e.g. on a train or plane)

... So, What Do We Build?


From a product perspective, we had many initial questions about how the feature should behave: What bitrate & resolution should we download content at? How much configuration should we offer to users? How will video bookmarks work when offline? How do we handle profiles?

We adopted some guiding principles based on general Netflix philosophies about what kind of products we want to create: the Downloads interface should not be so prominent that it's distracting, and the UX should be as simple as possible.

We chose an aggressive timeline for the feature since we wanted to deliver the experience to our members as soon as possible. We aimed to create a great experience with just the right amount of scope, and we could iterate and run A/B tests to improve the feature later on. Fortunately, our Consumer Insights team also had enough time to qualify our initial user-experience ideas with members and non-members before they were built.

How Do Downloads Work?


From an organizational perspective, the downloads feature was a test of coordination between a wide variety of teams. A technical spec was created that represented a balancing act of meeting license requirements, member desires, and security requirements (protecting from fraud). For Android, we used the technical spec to define which pieces of data we'd need to transfer to the client in order to provide a single 'downloaded video':
  • Content manifest (URLs for audio and video files)
  • Media files:
    • Primary video track
    • 2 audio tracks (one primary language plus an alternate based on user language preferences)
    • 2 subtitle tracks (based on user language preferences)
    • Trick play data (images while scrubbing)
  • DRM licenses
  • Title-level metadata and artwork (cached to disk)

Download Mechanics


We initially looked at Android's DownloadManager as the mechanism to actually transfer files and data to the client. This component was easy-to-use and handled some of the functionality we wanted. However, it didn’t ultimately allow us to create the UX we needed.

We created the Netflix DownloadManager for the following reasons:
  • Download Notifications: display download progress in a notification as an aggregate of all the files related to one 'downloadable video'.
  • Pause/Resume Downloads: provide a way for users to temporarily halt downloading.
  • Network Handling: dynamic network selection criteria in case the user changes this preference during a download (WiFi-only vs. any connection).
  • Analytics: understanding the details of all user behavior and the reasons why a download was halted.
  • Change of URL (CDN switching): Our download manifest provides multiple CDNs for the same media content. In case of failures to one CDN we wanted the ability to failover to alternate sources.

Storing Metadata


To store metadata for downloaded titles, our first implementation was a simple solution of serializing and deserializing json blobs to files on disk. We knew there would be problems with this (many objects created, GC churn, not developer-friendly), so while it wasn't our desired long-term solution, it met our needs to get a prototype off the ground.

For our second iteration of managing stored data, we looked at a few possible solutions including built-in SQLite support. We’d also heard a lot about Realm lately and a few companies that had success in using it as a fast and simple data-storage solution. Because we had limited experience with Realm and the downloads metadata case was relatively small and straightforward, we thought it would be a great opportunity to try Realm out.

Realm turned out to be easy to use and has a few benefits we like:
  • Zero-copy IO (by using memory mapping)
  • Strong performance profile
  • It's transactional and has crash safety (via MVCC)
  • Objects are easy to implement
  • Easy to query, no SQL statements
Realm also provides straightforward support for versioning of data, which allows data to be migrated from schema to schema if changed as part of an application update. In this case, a RealmMigration can be created which allows for mapping of data.

The challenges we had that most impacted our implementation included single thread access for objects and a lack of support for vectors such as List<>.

Now that the stability of Realm has been demonstrated in the field with downloads metadata, we are moving forward with adopting it more broadly in the app for generalized video metadata storage.


Updating Metadata


JobScheduler was introduced in Lollipop and allows us to be more resource-efficient in our background processing and network requests. The OS can batch jobs together for an overall efficiency gain. Longer-term, we wanted to build up our experience with this system component since developers will be encouraged more strongly by Google to use it in the future (e.g. Android ‘O’).

For our download use cases, it provided a great opportunity to get low-cost (or effectively free) network usage by creating jobs that would only activate when the user was on an unmetered network. What can our app do in the background?


1. Maintenance jobs:
  • Content license renewals
  • Metadata updates
  • Sync playback metrics: operations, state data, and usage
2. Resume downloads when connectivity restored

There were two major issues we found with JobScheduler. The first was how to provide the updates we needed from JobScheduler on pre-Lollipop devices? For these devices, we wrote an abstraction layer over top of the job-scheduling component, and on pre-Lollipop devices we use the system's Network Connectivity receiver and AlarmManager service to schedule background tasks manually at set times.

The second major problem we encountered with JobScheduler was its issue of crashing in certain circumstances (public bug report filed here). While we weren't able to put in a direct fix for this crash, we were able to determine a workaround whereby we avoided calling JobService.onJobFinished() altogether in certain cases. The job ultimately times out on its own so the cost of operating like this seemed better than permitting the app to crash.


Playback of Content


There are a number of methods of playing video on Android, varying in their complexity and level of control:

MethodComments/LimitationsNetflix Usage
MediaPlayerHigh-level API, hardware decoding but fairly high-level. Limited format support such as DASH, no easy way to apply DRM.Never Used
In-app solutionBundle everything in-app, including media playback and DRM, no use of Android system components. Everything on CPU, more battery drain, potentially lower quality playback.Used for early versions of SD playback
OpenMAX ALIntroduced in ICS, low-level APIs from the platform, opened many doors. However, native-c interfaces, deprecated quickly by Google. SD only since all DRM was done in-app.Used for ICS/JB
MediaCodecLow-level but in Java. Playback can be built on top of system components. First version only supported in-app DRM which was complex and only SD. In Android 4.3 Google, introduced a modular framework which allowed us to use built-in/platform Widevine DRM support. Provided for Widevine L1 which allows us to play HD content with a hardware dependency.Used for 4.2 and above, HD for some 4.3+ devices.

Further, playback of offline (non-streaming) content is not supported by the Android system DASH player. It wasn't the only option, but we felt that downloads were a good opportunity to try Google’s new Android ExoPlayer. The features we liked were:
  • Support for DASH, HLS, Smooth Streaming, and local sources
  • Extremely modular design, extensible and customizable
  • Used by Google/OEMs/SoC vendors as part of Android certification (goes to device support and fragmentation)
  • Great documentation and tutorials
The modularity of ExoPlayer was attractive for us since it allowed us to plug in a variety of DRM solutions. Our previous in-app DRM solution did not support offline licenses so we also needed to provide support for an alternate DRM mechanism.

Supporting Widevine


Widevine was selected due to its broad Android support, ability to work with offline licenses, a hardware-based decryption module with a software fallback (suitable for nearly any mobile device), and validation required by Android's Compatibility Test Suite (CTS).

However, this was a difficult migration due to Android fragmentation. Some devices that should have had L3 didn’t, some devices had insecure implementations, and other devices had Widevine APIs that failed whenever we called them. Support was therefore inconsistent, so we had to have reporting in place to monitor these failure rates.

If we detect this kind of failure during app init then we have little choice but to disable the Downloads feature on that device since playback would not be possible. This is unfortunate for users but will hopefully improve over time as the operating system is updated on devices.

Complete block diagram of playback components used for downloads.





Improving Video Quality


Our encoding team has written previously about the specific work they did to enable high-quality, low-bandwidth mobile encodes using VP9 for Android. However, how did we decide to use VP9 in the first place?

Most mobile video streams for Netflix use H.264/AVC with the Main Profile (AVCMain). Downloads were a good opportunity for us to migrate to a new video codec to reduce downloaded content size and pave the way for improved streaming bitrates in the future. The advantages of VP9 encoding for us included:
  • Encodes produced using libvpx are ~32% more efficient than our x264 encodes.
  • Decoder required since Android KitKat, i.e. 100% coverage for current Netflix app deployment
  • Fragmented but growing hardware support: 33% of phones and 4% of tablets using Netflix have a chipset that supports VP9 decoding in hardware
Migrating to support a new video encode had some up-front and ongoing costs, not the least of which was an increased burden placed on our content-delivery system, specifically our Open-Connect Appliances (OCAs). Due to the new encoding formats, more versions of the video streams needed to be deployed and cached in our CDN which required more space on the boxes. This cost was worthwhile for us to provide improved efficiency for downloaded content in the near term, and in the long term will also benefit members streaming on mobile as we migrate to VP9 more broadly.


Results


Many teams at Netflix were aligned to work together and release this feature under an ambitious timeline. We were pleased to bring lots of joy to our members around the world and give them the ability to take their favorite shows with them on the go.

The biggest proportion of downloading has been in Asia where we see strong traction in countries like India, Thailand, Singapore, Malaysia, Philippines, and Hong Kong.

The main suggestion we received for Android was around lack of SD card support, which we quickly addressed in a subsequent release in early 2017. We have now established a baseline experience for downloads, and will be able to A/B test a number of improvements and feature enhancements in coming months.



Netflix Security Monkey on Google Cloud Platform

$
0
0
Today we are happy to announce that Netflix Security Monkey has BETA support for tracking Google Cloud Platform (GCP) services. Initially we are providing support for the following GCP services:


  • Firewall Rules
  • Networking
  • Google Cloud Storage Buckets (GCS)
  • Service Accounts (IAM)


This work was performed by a few incredible Googlers with the mission to take open source projects and add support for Google’s cloud offerings. Thank you for the commits!


GCP support is available in the develop branch and will be included in release 0.9.0. This work helps to fulfill Security Monkey’s mission as the single place to go to monitor your entire deployment.


To get started with Security Monkey on GCP, check out the documentation.


See Rae Wang, Product Manager on GCP, highlight Security Monkey in her talk, “Gaining full control over your organization's cloud resources (presented at Google Cloud Next '17)”.

Security Monkey’s History

We released Security Monkey in June 2014 as an open source tool to monitor Amazon Web Services (AWS) changes and alert on potential security problems. In 2014 it was monitoring 11 AWS services and shipped with about two dozen security checks. Now the tool monitors 45 AWS services, 4 GCP services, and ships with about 130 security checks.

Future Plans for Security Monkey

We plan to continue decomposing Security Monkey into smaller, more maintainable, and reusable modules. We also plan to use new event driven triggers so that Security Monkey will recognize updates more quickly. With Custom Alerters, Security Monkey will transform from a purely monitoring tool to one that will allow for active response.


More Modular:
  • We have begun the process of moving the service watchers out of Security Monkey and into CloudAux. CloudAux currently supports the four GCP services and three (of the 45) AWS services.
  • We have plans to move the security checks (auditors) out of Security Monkey and into a separate library.
  • Admins may change polling intervals, enable/disable technologies, and modify issue scores from within the settings panel of the web UI.
Event Driven:
  • On AWS, CloudTrail will trigger CloudWatch Event Rules, which will then trigger Lambda functions. We have a working prototype of this flow.
  • On GCP, Stackdriver Logging and Audit Logs will trigger Cloud Functions.
  • As a note, CloudSploit has a product in beta that implements this event driven approach.
Custom Alerters:
  • These can be used to provide new notification methods or correct problems.
  • The documentation describes a custom alerter that sends events to Splunk.
We’ll be following up with a future blog post to discuss these changes in more detail. In the meantime, check out Security Monkey on GitHub, join the community of users, and jump into conversation in our Gitter room if you have questions or comments.

Special Thanks

We appreciate the great community support and contributions for Security Monkey and want to specially thank:

  • Google: GCP Support in CloudAux/Security Monkey
  • Bridgewater Associates: Modularization of Watchers, Auditors, Alerters. Dozens of new watchers. Modifying the architecture to abstract the environment being monitored.

Update on HTML5 Video for Netflix

$
0
0
About four years ago, we shared our plans for playing premium video in HTML5, replacing Silverlight and eliminating the extra step of installing and updating browser plug-ins.  

Since then, we have launched HTML5 video on Chrome OS, Chrome, Internet Explorer, Safari, Opera, Firefox, and Edge on all supported operating systems.  And though we do not officially support Linux, Chrome playback has worked on that platform since late 2014.  Starting today, users of Firefox can also enjoy Netflix on Linux.  This marks a huge milestone for us and our partners, including Google, Microsoft, Apple, and Mozilla that helped make it possible.

But this is just the beginning.  We launched 4K Ultra HD on Microsoft Edge in December of 2016, and look forward to high-resolution video being available on more platforms soon.  We are also looking ahead to HDR video.  Netflix-supported TVs with Chromecast built-in—which use a version of our web player—already support Dolby Vision and HDR10.  And we are working with our partners to provide similar support on other platforms over time.

Netflix adoption of HTML5 has resulted in us contributing to a number of related industry standards including:
  • MPEG-DASH, which describes our streaming file formats, including fragmented MP4 and common encryption.  
  • WebCrypto, which protects user data from inspection or tampering and allows us to provide our subscription video service on the web.  
  • Media Source Extensions (MSE), which enable our web application to dynamically manage the playback session in response to ever-changing network conditions.
  • Encrypted Media Extensions (EME), which enables playback of protected content, and hardware-acceleration on capable platforms.

We intend to remain active participants in these and other standards over time.  This includes areas that are just beginning to formulate, like the handling of HDR images and graphics in CSS being discussed in the Color on the Web community group.

Our excitement about HTML5 video has remained strong over the past four years.  Plugin-free playback that works seamlessly on all major platforms helps us deliver compelling experiences no matter how you choose to watch.  This is apparent when you venture through Stranger Things in hardware accelerated HD on Safari, or become transfixed by The Crown in Ultra HD on Edge.  And eventually, you will be able to delight in the darkest details of Marvel’s Daredevil in stunning High Dynamic Range.




The Netflix HERMES Test: Quality Subtitling at Scale

$
0
0

Since Netflix launched globally, the scale of our localization efforts has increased dramatically.  It’s hard to believe that just 5 years ago, we only supported English, Spanish and Portuguese.  Now we’ve surpassed 20 languages - including languages like Korean, Chinese, Arabic and Polish - and that number continues to grow.  Our desire to delight members in “their” language, while staying true to creative intent and mindful of cultural nuances is important to ensure quality. It’s also fueling a need to rapidly add great talent who can help provide top-notch translations for our global members across all of these languages.


The need for localization quality at an increasing scale inspired us to build and launch HERMES, the first online subtitling and translation test and indexing system by a major content creator.  Before now, there was no standard test for media translation professionals, even though their work touches millions of people’s lives on a daily basis.  There is no common registration through a professional organization which captures the total number of professional media translators worldwide, no license numbers, accreditations, or databases for qualified professionals.  For instance, the number of working, professional Dutch subtitlers is estimated to be about 100 - 150 individuals worldwide.  We know this through market research Netflix conducted during our launch in the Netherlands several years ago, but this is a very anecdotal “guesstimate” and the actual number remains unknown to the industry.  


RESOURCING QUALITY AT SCALE


In the absence of a common registration scheme and standardized test, how do you find the best resources to do quality media translation?  Netflix does this by relying on third parties to source and manage localization efforts for our content.  But even this method often lacks the precision needed to drive constant improvement and innovation in the media translation space.  Each of these vendors recruit, qualify and measure their subcontractors (translators) differently, so it’s nearly impossible for Netflix to maintain a standard across all of them to ensure constant quality at a reliability and scale we need to support our constant international growth.  We can measure the company’s success through metrics like rejection rates, on-time rates, etc., but we can’t measure the individual.  This is like trying to win the World Cup in soccer and only being able to look at your team’s win/loss record, not knowing how many errors your players are making, blindly creating lineups without scoring averages and not having any idea how big your roster is for the next game.  It’s difficult and frustrating to try to “win” in this environment, yet this is largely how Netflix has had to operate in the localization space for the last few years, while still trying to drive improvement and quality.  


THE TEST


HERMES is emblematic of Hollywood meets Silicon Valley at Netflix, and was developed internally by the Content Localization and Media Engineering teams, with collaboration from renowned academics in the media translation space to create this five part test for subtitlers.  The test is designed to be highly scalable and consists of thousands of randomized combinations of questions so that no two tests should be the same.  The rounds consist of multiple choice questions given at a specifically timed pace, designed to test the candidate’s ability to:


  • Understand English
  • Translate idiomatic phrases into their target language
  • Identify both linguistic and technical errors
  • Subtitle proficiently


Idioms are expressions that are often times specific to a certain language (“you’re on a roll”, “he bought the farm”) and can be a tough challenge to translate into other languages. There are approximately 4,000 idioms in the English language and being able to translate them in a culturally accurate way is critical to preserving the creative intent for a piece of content.  Here’s an example from the HERMES test for translating English idioms into Norwegian:

Screen Shot 2017-03-28 at 8.53.16 AM.png

Upon completion, Netflix will have a good idea of the candidate’s skill level and can use this information to match projects with high quality language resources.  The real long term value of the HERMES platform is in the issuance of HERMES numbers (H-humbers).  This unique identifier is issued to each applicant upon sign-up for the test and will stick with them for the remainder of their career supplying translation services to Netflix.   By looking at the quantity of H-Numbers in a given language, Netflix can start to more precisely estimate the size of the potential resource pool for a given language and better project our time needed to localize libraries. Starting this summer, all subtitles delivered to Netflix will be required to have a valid H-Number tied to it.  This will allow Netflix to better correlate the metrics associated with a given translation to the individual who did the work.  


INNOVATION


Over time, we’ll be able to use these metrics in concert with other innovations to “recommend” the best subtitler for specific work based on their past performance to Netflix.  Much like we recommend titles to our members, we aim to match our subtitlers in a similar way.  Perhaps they consider themselves a horror aficionado, but they excel at subtitling romantic comedies - theoretically, we can make this match so they’re able to do their best quality work.  


PROGRESS


Since we unveiled our new HERMES tool two weeks ago, thousands of candidates around the world have already completed the test, covering all represented languages.  This is incredible to us because of the impact it will ultimately have on our members as we focus on continually improving the quality of the subtitles on the service.  We’re quickly approaching an inflection point where English won’t be the primary viewing experience on Netflix, and HERMES allows us to better vet the individuals doing this very important work so members can enjoy their favorite TV shows and movies in their language.

If you're a professional subtitler interested in taking the test, you can take it here.

Update: The volume of applicants has exceeded even our most optimistic expectations, so we have scaled up support resources. If you're encountering any issues taking the test, please escalate all questions and feedback to hermes.support@netflix.com and someone will be in touch soon.

By Chris Fetner and Denny Sheehan

Netflix Conductor: Inversion of Control for workflows

$
0
0

In December 2016, we open sourced Netflix Conductor.
We have been working hard since then to add more features, enhance the user interface for monitoring the workflows, and harden the system.  We have seen an increase in usage, new use cases, feature requests, and lot of interest from the community.  In this post, we will talk about a couple of major features we added to Conductor this quarter.

IoC in Conductor

Conductor makes it possible to reuse workflows and embed them as sub workflows.  Sub workflows are a great way to promote process reuse and modularity.  As Netflix Conductor adoption increased, we observed interesting use cases where business processes were initiated based off the events and state changes in other workflows.  


Traditionally, applications implemented such use cases using pub/sub systems and publishing events as application state changed and subscribing to interested events to initiate actions.


We sought a solution with the following characteristics:


  1. Loose coupling between workflows
  2. Allow starting a workflow based on state changes in other workflows
  3. Provide integration with external systems producing / consuming events via SQS/SNS


Enter InversionofControl - for workflows.  The idea is to be able to chain workflow actions such as starting a workflow, completing a task in a workflow, etc. based on state changes in other workflows.


To illustrate this further, let’s take a look at two solutions to trigger a “QC Workflow” after a file has been ingested.  Once a file has been ingested, multiple workflows are triggered (owned by separate applications); one of them is  a file verification workflow (QC Workflow) that is triggered to identify the type of the file (Audio, Video etc.), run appropriate checks, and start the encoding process for the file.  


With Conductor, there are two separate ways to achieve this; both the solutions get the job done, but there are subtle differences in both the approaches.

Starting another workflow as sub workflow



Given that sub workflows are tasks, their output can be consumed by other tasks in the File Ingest workflow subsequent to the QC workflow, e.g. outcome of the QC process.  Sub workflows thus are useful when there is a tight dependency and coupling between business processes.

Starting a workflow on an event




In the above approach, the File Ingest workflow produces an “Ingest Complete” event.  This event can be consumed by one or more event handlers to execute actions - including starting of a workflow.


  • File Ingest workflow does not have any direct workflow dependencies
  • Workflows can be triggered based on events promoting loose coupling
  • The outcome of QC Task workflow does not impact the File Ingest workflow
  • Multiple workflows or actions can be executed per each event


EVENT Task

We introduced a new type of task called EVENT.  An event task can be added to the workflow definition.  When executed, it produces an event in the specified “sink”.  A sink is an eventing system such as SQS, Conductor, or any other supported systems.  Sinks follow a pluggable architecture where systems such as JMS, Kafka etc. can be added by implementing the required interfaces and making the implementation available in the classpath of the Conductor server JVM.


Following the task input / output model of conductor, the input to the EVENT task is calculated and is sent as payload.  The sink can be either Conductor or an external system such as SQS.  The architecture for supporting various types of Sink is plugin based and new types such as JMS or Kafka can be easily added by implementing the required interfaces and making it available in the JVM classpath for Conductor server.


Below is an example of an event task that publishes an event to an SQS queue identified by name.


{
   "name": "example_event",
   "taskReferenceName": "event0",
   "input": {
       "filename": "${workflow.input.filename}",
       "location": "${anothertask.output.location}"
   },
   "type": "EVENT",
   "sink": "sqs:sqs_queue_name"
}


Event Handlers

Event handlers are listeners that are executed upon arrival of a specific event.  Handlers listen to various event sources including Conductor and SQS/SNS and allows users to plug in other sources via APIs. Conductor supports multiple event handlers per event type.  Event handlers are managed with Conductor using the /event API endpoints.   It is possible to have multiple event handlers for a single event source.


Event Condition

Condition is a JavaScript expression on the payload, which MUST evaluate to true for the event handler to execute the actions.  The condition acts as a the filter to selectively take actions on a subset of events matching the criteria.  Conditional filters are optional and when not specified all the events from the source are processed.

Event Actions

Each Event handler has one or more actions associated with it.  When the condition associated with an event is evaluated to “true”, the actions are executed.  The supported actions are:


  • Start a workflow
  • Complete a task as failed or success


Example

Below a sample event handler that listens to an SQS queue and starts the workflow based on the condition.


{
   "name": "conductor_event_test",
   "event": "sqs:example_sqs_queue_name",
   "condition": "$.file_type == 'image'",
   "actions": [
       {
           "action": "start_workflow",
           "start_workflow": {
               "name": "workflow_name",
               "input": {
                   "file_name": "${filename}",
                   "file_type": "${file_type}"
               }
           }
       }
   ],
   "active": true
}


Conductor UI

All the registered event handlers can be inspected in the UI. Updates can be made via REST APIs or through the swagger API page.

JSON Data Transformation

Conductor now supports JSONPath based data transformation.  Using JSON Path in the input configuration to task allows complex data transformation and reduces the need to write one-off tasks that otherwise does the data transformation.

Path forward

While it’s been a busy quarter, we are not done yet.  Looking forward to Q2, our focus is going to be on making it easier to test the workflows for the developers and support for logging task execution context to help understand and troubleshooting workflows spanning multiple workers and applications.


If you like the challenges of building distributed systems and are interested in building the Netflix studio ecosystem and the content pipeline at scale, check out ourjob openings.




BetterTLS - A Name Constraints test suite for HTTPS clients

$
0
0

Written by Ian Haken

At Netflix we run a microservices architecture that has hundreds of independent applications running throughout our ecosystem. One of our goals, in the interest of implementing security in depth, is to have end-to-end encrypted, authenticated communication between all of our services wherever possible, regardless of whether or not it travels over the public internet. Most of the time, this means using TLS, an industry standard implemented in dozens of languages. However, this means that every application in our environment needs a TLS certificate.

Bootstrapping the identity of our applications is a problem we have solved, but most of our applications are resolved using internal names or are directly referenced by their IP (which lives in a private IP space). Public Certificate Authorities (CAs) are specifically restricted from issuing certificates of this type (see section 7.1.4.2.1 of the CA/B baseline requirements), so it made sense to use an internal CA for this purpose. As we convert applications to use TLS (e.g., by using HTTPS instead of HTTP) it was reasonably straightforward to configure them to use a truststore which includes this internal CA. However, the question remained of what to do about users accessing their services using a browser. Our internal CA isn’t trusted by browsers out-of-the-box, so what should we do?

The most obvious answer is straightforward: “add the CA to browsers’ truststores.” But we were hesitant about this solution. By forcing our users to trust a private CA, they must take on faith that this CA is only used to mint certificates for internal services and is not being used to man-in-the-middle traffic to external services (such as banks, social media sites, etc). Even if our users do take on faith our good behavior, the impact of a compromise to our infrastructure becomes significant; not only could an attacker compromise our internal traffic channels, but all of our employees are suddenly at risk, even when they’re at home.
Fortunately, the often underutilized Name Constraints extension provides us a solution to both of these concerns.

The Name Constraints Extension

One powerful (but often neglected) feature of the TLS specification is the Name Constraints extension. This is an extension that can be put on CA certificates which whitelists and/or blacklists the domains and IPs for which that CA or any sub-CAs are allowed to create certificates for. For example, suppose you trust the Acme Corp Root CA, which delegates to various other sub-CAs that ultimately sign certificates for websites. They may have a certificate hierarchy that looks like this:

Now suppose that Beta Corp and Acme Corp become partners and need to start trusting each other’s services. Similar to Acme Corp, Beta Corp has a root CA that has signed certificates for all of its services. Therefore, services inside Acme Corp need to trust the Beta Corp root CA. Rather than update every service in Acme Corp to include the new root CA in its truststore, a simpler solution is for Acme Corp to cross-certify with Beta Corp so that the Beta Corp root CA has a certificate signed by the the Acme Root CA. For users inside Acme Corp their trust hierarchy now looks like this.

However, this has the undesirable side effect of exposing users inside of Acme Corp to the risk of a security incident inside Beta Corp. If a Beta Corp CA is misused or compromised, it could issue certificates for any domain, including those of Acme Corp.

This is where the Name Constraints extension can play a role. When Acme Corp signs the Beta Corp root CA certificate, it can include an extension in the certificate which declares that it should only be trusted to issue certificates under the “betacorp.com” domain. This way Acme Corp users would not trust mis-issued certificates for the “acmecorp.com” domain from CAs under the Beta Corp root CA.

This example demonstrates how Name Constraints can be useful in the context of CA cross-certification, but it also applies to our original problem of inserting an internal CA into browsers’ trust stores. By minting the root CA with Name Constraints, we can limit what websites could be verified using that trust root, even if the CA or any of its intermediaries were misused.

At least, that’s how Name Constraints should work.

The Trouble with Name Constraints

The Name Constraints extension lives on the certificate of a CA but can’t actually constrain what a bad actor does with that CA’s private key (much less control what a subordinate CA issues), so even with the extension present there is nothing to stop the bad actor from signing a certificate which violates the constraint. Therefore, it is up to the TLS client to verify that all constraints are satisfied whenever the client verifies a certificate chain.

This means that for the Name Constraints extension to be useful, HTTPS clients (and browsers in particular) must enforce the constraints properly.

Before relying on this solution to protect our users, we wanted to make sure browsers were really implementing Name Constraints verification and doing so correctly. The initial results were promising: each of the browsers we tested (Chrome, Firefox, Edge, and Safari) all gave verification exceptions when browsing to a site where a CA signed a certificate in violation of the constraints.

However, as we extended our test suite beyond basic tests we rapidly began to lose confidence. We created a battery of test certificates which moved the subject name between the certificate’s subject common name and Subject Alternate Name extension, which mixed the use of Name Constraint whitelisting and blacklisting, and which used both DNS names and IP names in the constraint. The result was that every browser (except for Firefox, which showed a 100% pass rate) and every HTTPS client (such as Java, Node.JS, and Python) allowed some sort of Name Constraint bypass.

Introducing BetterTLS

shot-1488406780.png
In order to raise awareness around the issues we discovered and encourage TLS implementers to correct them, and to allow them to include some of these tests in their own test suite, we are open sourcing the test suite we created and making it available online. Inspired by badssl.com, we created bettertls.com with the hope that the tests we add to this site can help improve the resiliency of TLS implementations.

Before we made bettertls.com public, we reached out to many of the affected vendors and are happy to say that we received a number of positive responses. We’d particularly like to thank Ryan Sleevi and Adam Langley from Google who were extremely responsive and immediately took actions to remediate some of the discovered issues and incorporate some of these test certificates into their own test suite. We have also received confirmation from Oracle that they will be addressing the results of this test suite in Java in an upcoming security release.

The source for bettertls.com is available on github, and we welcome suggestions, improvements, corrections, and additional tests!

Introducing Bolt: On Instance Diagnostic and Remediation Platform

$
0
0
Last August we introduced Winston, our event driven diagnostic and remediation platform. Winston helps orchestrate diagnostic and remediation actions from the outside. As part of that orchestration, there are multiple actions that need to be performed at an AWS instance(vm) level to collect data or take mitigation steps. We would like to discuss a supporting service called Bolt that helps with instance level action executions. By `action`, we refer to runbook, script or automation code.

Problem space

Netflix does not run its own data centers. We use AWS services for all our infrastructure needs. While we do utilize value added services from AWS (SQS, S3, etc.), much of our usage for AWS is on top of the core compute service provided by Amazon called EC2.

As part of operationalizing our fleet of EC2 instances, we needed a simple way of automating common diagnostics and remediation tasks on these instances. This solution could integrate with our orchestration services like Winston to run these actions across multiple instances, making it easy to collect data and aggregate at scale as needed. This solution would became especially important for our instances hosting persistent services like Cassandra which are more long lived than our non-persistent services. For example, when a Cassandra node is running low on disk space, a Bolt action would be executed to analyze if any stale snapshots are laying around and if so, reclaim disk space without user intervention.

Bolt to the rescue

Bolt is an instance level daemon that runs on every instance and exposes custom actions as REST commands for execution. Developers provide their own custom scripts as actions to Bolt which are version controlled and then seamlessly deployed across all relevant EC2 instance dynamically. Once deployed, these actions are available for execution via REST endpoints through Bolt.


Some interesting design decisions when building Bolt were
  • There is no central instance catalog in Bolt. The cloud is ever changing and ephemeral. Maintaining this dynamic list is not our core value add. Instead, Bolt client gets this list on demand from Spinnaker. This allows clear separation of concern from the systems that provide the infrastructure map from Bolt that uses that map to act upon.
  • No middle tier service, Bolt client libraries and CLI tools talk directly to the instances. This allows Bolt to scale to thousands of instances and reduces operational complexity.
  • History of executions stays on the instance v.s. in a centralized external store. By decentralizing the history store and making it instance scoped and tied to the life of that instance, we reduce operational cost and simplify scalability at the cost of losing the history when the instance is lost.
  • Lean resource usage: The agent only uses 50Mb of memory and is niced to prevent it from taking over CPU.
Below, we go over some of the more interesting features of Bolt.

Language support

The main languages used for writing Bolt actions are Python and Bash/Shell. Actually, any scripting language that are installed on the instance can be used (Groovy, Perl, …), as long as a proper shebang is specified.


The advantage of using Python is that we provide per-pack virtual environment support which give dependency isolation to the automation. We also provide, through Winston Studio,  self-serve dependency management using the standard requirements.txt approach.

Self serve interface

We chose to extend Winston Studio to supports CRUD & Execute operations for both Winston and Bolt. By providing a simple and intuitive interface on where to upload your Bolt actions, look at execution history, experiment and execute on demand, we ensured that the Fault Detection Engineering team is not the bottleneck for operations associated with Bolt actions.


Here is a screenshot of what a single Bolt action looks like. Users (Netflix engineers) can iterate, introspect and manage all aspects of their action via this studio. This studio also implements and exposes the paved path for action deployment to make it easy for engineers to do the right thing to mitigate risks.




Users can also look at the previous executions and individual execution details through Winston Studio as shown in the following snapshots.




Here is an example requirements.txt file which specifies the pack dependencies (for Python Bolt actions):

Action lifecycle management

Similar to Winston, we help ensure that these actions are version controlled correctly and we enforce staged deployment. Here are the 3 stages:
  1. Dev: At this stage, a Bolt action can only be executed from the Studio UI. They are only deployed at runtime on the instance where they will be executed. This stage is used for development purposes as the name suggests.
  2. Test: When development is completed, the user promote the action to Test environment. At this stage, within ~5 minutes, the action will be deployed on all relevant Test EC2 instances. This action will then sit at this stage for a couple of hours/days/weeks (at the discretion of the user) to catch edge cases and unexpected issues at scale.
  3. Prod: When the user is comfortable with the stability of the action, the next step is to promote it to Prod. Again, within ~5 minutes, the action will be deployed on all relevant Prod EC2 instances.

Security

Even though this is an internal administrative service, security was a big aspect of building software that installs on all EC2 instances we run. Here are some key security features and decisions we took to harden the service from bad actors
  • White listing actions. We do not allow running arbitrary commands on instances. Developers explicitly choose to expose a command/script on their service instances via Bolt.
  • Auditing - All CRUD changes to actions are authenticated and audited in the studio. Netflix engineer’s have to use their credentials to be able to make any change to the whitelisted actions.
  • Mutual TLS authenticated REST endpoints: Arbitrary systems cannot invoke executions via Bolt on instances.

Async execution

The decision of choosing an Async API to execute actions allows the ability to run long running diagnostics or remediation actions without blocking the client. It also allows the clients to scale to managing thousands of instances in short interval of time through fire and check back later interface.


Here is a simple sequence diagram to illustrate the flow of an action being executed on a single EC2 instance:

Client libraries

The Bolt ecosystem consists of both a Python and Java client libraries for integrations..
This client library also makes the task of TLS authenticated calls available out of the box as well as implements common client side patterns of orchestration.


What do we mean by orchestration? Let say that you want to restart tomcat on 100 instances. You probably don’t want to do it on all instances at the same time, as your service would experience down-time. This is where the serial/parallel nature of an execution comes into play. We support running an action one instance at a time, one EC2 Availability Zone at a time, one EC2 Region at a time, or on all instances in parallel (if this is a read-only action for example). The user decides which strategy applies to his action (the default is one instance at a time, just to be safe).

Resiliency features

  • Bolt engine and all action executions are run at a nice level of 1 to prevent them from taking over the host
  • Health check: Bolt exposes a health check API and we have logic that periodically check for health of the Bolt service across all instances. We also use it for our deployments and regression testing.
  • Metrics monitoring: CPU/memory usage of the engine and running actions
  • Staged/controlled Deployments of the engine and the actions
  • No disruption during upgrades: Running actions won’t be interrupted by Bolt upgrades
  • No middle tier: better scalability and reliability

Other features

Other notables features include the support for action timeout and the support for killing running actions. Also, the user can see the output (stdout/stderr) while the action is running (doesn’t have to wait for the action to be complete).

Use cases

While Bolt is very flexible as to what action it can perform, the majority of use cases fall into these patterns:

Diagnostic

Bolt is used as a diagnostic tool to help engineers when their system is not behaving as expected. Also, some team uses it to gather metadata about their instance (disk usage, version of packages installed, JDK version, …). Others use Bolt to get detailed disk usage information when a low disk space alert gets triggered.

Remediation

Remediation is an approach to fix an underlying condition that is rendering a system non-optimal. For example, Bolt is used by our Data Pipeline team to restart the Kafka broker when needed. It is also used by our Cassandra team to free up disk space.

Proactive maintenance

In the case of our Cassandra instances, Bolt is used for proactive maintenance (repairs, compactions, Datastax & Priam binary upgrades, …). It is also used for binary upgrades on our Dynomite instances (for Florida and Redis).

Some usage numbers

Here are some usage stats number:

  • Thousands of actions execution per day
  • Bolt is installed on tens of thousands of instances

Architecture

Below is a simplified diagram of how Bolt actions are deployed on the instances:
  1. When the user modifies an action from Winston Studio, it is committed to Git version control system (for auditing purposes, Stash in our case) and synced to a Blob store (AWS S3 bucket in our case)
  2. Every couple of minutes, all the Bolt-enabled Netflix instances check if there is a new version available for each installed pack and proceed with its upgrade if needed.


It also explains how they are being triggered (either manually from Winston Studio or in response to an event from Winston).


The diagram also summarizes how Bolt is stored on each instance:
  • The Bolt engine binary, the Bolt packs and the Python virtual environments are installed on the root device
  • Bolt log files, Action stdout/stderr output and the Sqlite3 database that contains the executions history are stored on the ephemeral device. We keep about 180 days of execution history and compress the stdout/stderr a day after the execution is completed, to same space.


It also shows that we send metrics to Atlas to track usage as well as Pack upgrades failures. This is critical to ensure that we get notified of any issues.



Build vs. Buy

Here are some alternatives we looked at before building Bolt.

SSH

Before Bolt, we were using plain and simple SSH to connect to instances and execute actions as needed. While inbuilt in every VM and very flexible, there were some critical issues with using this model:


  1. Security: Winston, or any other orchestrator, needs to be a bastion to have the keys to our kingdom, something we were not comfortable with
  2. async vs. sync: Running SSH commands meant we could only run them in a sync manner blocking a process on the client side to wait for the action to finish. This sync approach has scalability concerns as well as reliability issues (long running actions were sometime interrupted by network issues)
  3. History: Persisting what was run for future diagnostics.
  4. Admin interface


We discussed the idea of using separate sudo rules, a dedicated set of keys and some form of binary whitelisting to improve the security aspect. This would alleviate our security concerns, but would have limited our agility to add custom scripts (which could contains non whitelisted binaries).


In the end, we decided to invest in building technology that allows us to support the instance level use case with the flexibility to add value added features while strengthening the security, reliability and performance challenges we had.

AWS EC2 Run

AWS EC2 Run is a great option if you need to run remote script on AWS EC2 instances. But in October 2014 (when Bolt was created), EC2 Run was either not created yet, or not public (introduction post was published on October 2015). Since then, we sync quarterly with the EC2 Run team from AWS to see how we can leverage their agent to replace the Bolt agent. But, at the time of writing, given the Netflix specific features of Bolt and it’s integration with our Atlas Monitoring system, we decided not to migrate yet.

Chef/Puppet/Ansible/Salt/RunDeck ...

These are great tools for configuration management/orchestration. Sadly, some of them (Ansible/RunDeck) had dependencies on SSH, which was an issue for the reasons listed above. For Chef and Puppet, we assessed that they were not quite the right tool for what we were trying to achieve (being mostly configuration management tools). As for Salt, adding a middle tier (Salt Master) meant increased risk of down time even with a Multi-Master approach. Bolt doesn’t have any middle tier or master, which significantly reduces the risk of down time and allows better scalability.

Future work

Below is a glimpse of some future work we plan to do in this space.

Resources Capping

As an engineer, one of the biggest concern when running extra agent/process is: “Could it take over my system memory/cpu?”. In this regard, we plan to invest into memory/cpu capping as an optional configuration in Bolt. We already reduce processing priority of Bolt and the actions processes with ‘nice’, but memory is currently unconstrained. The current plan is to use cgroups.

Integrate with Spinnaker

The Spinnaker team is planning on using Bolt to gather metadata information about running instances as well as other general use-cases that applies to all Netflix micro-services (Restart Tomcat, …). For this reason, we are planning to add Bolt to our BaseAMI, which will automatically make it available on all Netflix micro-services.

Summary

Since its creation to solve specific needs for our Cloud Database Engineering group to its broader adoption, and soon to be included in all Netflix EC2 instances, Bolt has evolved into a reliable and effective tool.


By: Jean-Sebastien Jeannotte and Vinay Shah
on behalf of the Fault Detection Engineering team

The Evolution of Container Usage at Netflix

$
0
0
Containers are already adding value to our proven globally available cloud platform based on Amazon EC2 virtual machines.  We’ve shared pieces of Netflix’s container story in the past (video, slides), but this blog post will discuss containers at Netflix indepth.  As part of this story, we will cover Titus: Netflix’s infrastructural foundation for container based applications.  Titus provides Netflix scale cluster and resource management as well as container execution with deep Amazon EC2 integration and common Netflix infrastructure enablement.


This month marks two major milestones for containers at Netflix.  First, we have achieved a new level of scale, crossing one million containers launched per week.  Second, Titus now supports services that are part of our streaming service customer experience.  We will dive deeper into what we have done with Docker containers as well as what makes our container runtime unique.

History of Container Growth

Amazon’s virtual machine based infrastructure (EC2) has been a powerful enabler of innovation at Netflix.  In addition to virtual machines, we’ve also chosen to invest in container-based workloads for a few unique values they provide.  The benefits, excitement and explosive usage growth of containers from our developers has surprised even us.


While EC2 supported advanced scheduling for services, this didn’t help our batch users.  At Netflix there is a significant set of users that run jobs on a time or event based trigger that need to analyze data, perform computations and then emit results to Netflix services, users and reports.  We run workloads such as machine learning model training, media encoding, continuous integration testing, big data notebooks and CDN deployment analysis jobs many times each day.  We wanted to provide a common resource scheduler for container based applications independent of workload type that could be controlled by higher level workflow schedulers.  Titus serves as a combination of a common deployment unit (Docker image) and a generic batch job scheduling system. The introduction of Titus has helped Netflix expand to support the growing batch use cases.


With Titus, our batch users are able to put together sophisticated infrastructure quickly due to having to only specify resource requirements.  Users no longer have to deal with choosing and maintaining AWS EC2 instance sizes that don’t always perfectly fit their workload.  Users trust Titus to pack larger instances efficiently across many workloads.  Batch users develop code locally and then immediately schedule it for scaled execution on Titus.  Using containers, Titus runs any batch application letting the user specify exactly what application code and dependencies are needed.  For example, in machine learning training we have users running a mix of Python, R, Java and bash script applications.


Beyond batch, we saw an opportunity to bring the benefits of simpler resource management and a local development experience for other workloads.  In working with our Edge, UI and device engineering teams, we realized that service users were the next audience.  Today, we are in the process of rebuilding how we deploy device-specific server-side logic to our API tier leveraging single core optimized NodeJS servers.  Our UI and device engineers wanted a better development experience, including a simpler local test environment that was consistent with the production deployment.


In addition to a consistent environment, with containers developers can push new application versions faster than before by leveraging Docker layered images and pre-provisioned virtual machines ready for container deployments.  Deployments using Titus now can be done in one to two minutes versus the tens of minutes we grew accustomed to with virtual machines.  


The theme that underlies all these improvements is developer innovation velocity.  Both batch and service users can now experiment locally and test more quickly.  They can also deploy to production with greater confidence than before.  This velocity drives how fast features can be delivered to Netflix customers and therefore is a key reason why containers are so important to our business.

Titus Details

We have already covered what led us to build Titus.  Now, let’s dig into the details of how Titus provides these values.  We will provide a brief overview of how  Titus scheduling and container execution supports the service and batch job requirements as shown in the below diagram.


Screen Shot 2017-04-17 at 2.52.01 PM.png


Titus handles the scheduling of applications by matching required resources and available compute resources.  Titus supports both service jobs that run “forever” and batch jobs that run “until done”.  Service jobs restart failed instances and are autoscaled to maintain a changing level of load.  Batch jobs are retried according to policy and run to completion.  


Titus offers multiple SLA’s for resource scheduling.  Titus offers on-demand capacity for ad hoc batch and non-critical internal services by autoscaling capacity in EC2 based on current needs.  Titus also offers pre-provisioned guaranteed capacity for user facing workloads and more critical batch.   The scheduler does both bin packing for efficiency across larger virtual machines and anti-affinity for reliability spanning virtual machines and availability zones.  The foundation of this scheduling is a Netflix open source library called Fenzo.


Titus’s container execution, which runs on top of EC2 VMs, integrates with both AWS and Netflix infrastructure. We expect users to use both virtual machines and containers for a long time to come so we decided that we wanted the cloud platform and operational experiences to be as similar as possible.  In using AWS we choose to deeply leverage existing EC2 services.  We used Virtual Private Cloud (VPC) for routable IPs rather than a separate network overlay.  We leveraged Elastic Network Interfaces (ENIs) to ensure that all containers had application specific security groups.  Titus provides a metadata proxy that enables containers to get a container specific view of their environment as well as IAM credentials.  Containers do not see the host’s metadata (e.g., IP, hostname, instance-id).  We implemented multi-tenant isolation (CPU, memory, disk, networking and security) using a combination of Linux, Docker and our own isolation technology.


For containers to be successful at Netflix, we needed to integrate them seamlessly into our existing developer tools and operational infrastructure.  For example, Netflix already had a solution for continuous delivery – Spinnaker.  While it might have been possible to implement rolling updates and other CI/CD concepts in our scheduler, delegating this feature set to Spinnaker allowed for our users to have a consistent deployment tool across both virtual machines and containers.  Another example is service to service communication.  We avoided reimplementing service discovery and service load balancing.  Instead we provided a full IP stack enabling containers to work with existing Netflix service discovery and DNS (Route 53) based load balancing.   In each of these examples, a key to the success of Titus was deciding what Titus would not do, leveraging the full value other infrastructure teams provide.


Using existing systems comes at the cost of augmenting these systems to work with containers in addition to virtual machines.  Beyond the examples above, we had to augment our telemetry, performance autotuning, healthcheck systems, chaos automation, traffic control, regional failover support, secret management and interactive system access.  An additional cost is that tying into each of these Netflix systems has also made it difficult to leverage other open source container solutions that provide more than the container runtime platform.


Running a container platform at our level of scale (with this diversity of workloads) requires a significant focus on reliability.  It also uncovers challenges in all layers of the system.  We’ve dealt with scalability and reliability issues in the Titus specific software as well as the open source we depend on (Docker Engine, Docker Distribution, Apache Mesos, Snap and Linux).  We design for failure at all levels of our system including reconciliation to drive consistency between distributed state that exists between our resource management layer and the container runtime.  By measuring clear service level objectives (container launch start latency, percentage of containers that crash due to issues in Titus, and overall system API availability) we have learned to balance our investment between reliability and functionality.


A key part of how containers help engineers become more productive is through developer tools.  The developer productivity tools team built a local development tool called Newt (Netflix Workflow Toolkit).  Newt helps simplify container development both iteratively locally and through Titus onboarding.  Having a consistent container environment between Newt and Titus helps developer deploy with confidence.

Current Titus Usage

We run several Titus stacks across multiple test and production accounts across the three Amazon regions that power the Netflix service.


When we started Titus in December of 2015, we launched a few thousand containers per week across a handful of workloads.  Last week, we launched over one million containers. These containers represented hundreds of workloads.  This 1000X increase in container usage happened over a year timeframe, and growth doesn’t look to be slowing down.


We run a peak of 500 r3.8xl instances in support of our batch users.  That represents 16,000 cores of compute with 120 TB of memory.  We also added support for GPUs as a resource type using p2.8xl instances to power deep learning with neural nets and mini-batch.


In the early part of 2017, our stream-processing-as-a-service team decided to leverage Titus to enable simpler and faster cluster management for their Flink based system.  This usage has resulted in over 10,000 service job containers that are long running and re-deployed as stream processing jobs are changed.  These and other services use thousands of m4.4xl instances.


While the above use cases are critical to our business, issues with these containers do not impact Netflix customers immediately.  That has changed as Titus containers recently started running services that satisfy Netflix customer requests.


Supporting customer facing services is not a challenge to be taken lightly.  We’ve spent the last six months duplicating live traffic between virtual machines and containers.  We used this duplicated traffic to learn how to operate the containers and validate our production readiness checklists.  This diligence gave us the confidence to move forward making such a large change in our infrastructure.

The Titus Team

One of the key aspects of success of Titus at Netflix has been the experience and growth of the Titus development team.  Our container users trust the team to keep Titus operational and innovating with their needs.


We are not done growing the team yet.  We are looking to expand the container runtime as well as our developer experience.  If working on container focused infrastructure excites you and you’d like to be part of the future of Titus check out our jobs page.



On behalf of the entire Titus development team

Towards true continuous integration: distributed repositories and dependencies

$
0
0
For the past 8 years, Netflix has been building and evolving a robust microservice architecture in AWS. Throughout this evolution, we learned how to build reliable, performant services in AWS. Our microservice architecture decouples engineering teams from each other, allowing them to build, test and deploy their services as often as they want. This flexibility enables teams to maximize their delivery velocity. Velocity and reliability are paramount design considerations for any solution at Netflix.
As supported by our architecture, microservices provide their consumers with a client library that handles all of the IPC logic. This provides a number of benefits to both the service owner and the consumers. In addition to the consumption of client libraries, the majority of microservices are built on top of our runtime platform framework, which is composed of internal and open source libraries.
While service teams do have the flexibility to release as they please, their velocity can often be hampered by updates to any of the libraries they depend on. An upcoming product feature may require a number of microservices to pick up the latest version of a shared library or client library. Updating dependency versions carries risk.
Or put simply, managing dependencies is hard.
Updating your project’s dependencies could mean a number of potential issues, including:
  • Breaking API changes - This is the best case scenario. A compilation failure that breaks your build. Semantic versioning combined with dependency locking and dynamic version selectors should be sufficient for most teams to prevent this from happening, assuming cultural rigor around semver. However, locking yourself into a major version makes it that much harder to upgrade the company’s codebase, leading to prolonged maintenance of older libraries and configuration drift.
  • Transitive dependency updates - Due to the JVM’s flat classpath, only a single version of a class can exist within an application. Build tools like Gradle and Maven handle version conflict resolution preventing multiple versions of the same library to be included. This will also mean that there is now code within your application that is running with a transitive dependency version that it has never been tested against.
  • Breaking functional changes - Welcome to the world of software development! Ideally this is mitigated by proper testing. Ideally, library owners are able to run their consumer’s contract tests to understand the functionality that is expected of them.
To address the challenges of managing dependencies at scale, we have observed companies moving towards two approaches: Share little and monorepos.
  • The share little approach (or don’t use shared libraries) has been recently popularized by the broader microservice movement. The share little approach states that no code should be shared between microservices. Services should only be coupled via their HTTP APIs. Some recommendations even go as far as to say that copy and paste is preferable to share libraries. This is the most extreme approach to decoupling.
  • The monorepo approach dictates that all source code for the organization live in a single source repository. Any code change should be compiled/tested against everything in the repository before being pushed to HEAD. There are no versions of internal libraries, just what is on HEAD. Commits are gated before they make it to HEAD. Third party library versions are generally limited to one of two “approved” versions.
While both approaches address the problems of managing dependencies at scale, they also impose certain challenges. The share little approach favors decoupling and engineering velocity, while sacrificing code reuse and consistency. The monorepo approach favors consistency and risk reduction, while sacrificing freedom by requiring gates to deploying changes. Adopting either approach would entail significant changes to our development infrastructure and runtime architecture. Additionally, both solutions would challenge our culture of Freedom and Responsibility.
The challenge we’ve posed to ourselves is this:
Can we provide engineers at Netflix the benefits of a monorepo and still maintaining the flexibility of distributed repositories?
Using the monorepo as our requirements specification, we began exploring alternative approaches to achieving the same benefits. What are the core problems that a monorepo approach strives to solve? Can we develop a solution that works within the confines of a traditional binary integration world, where code is shared?
Our approach, while still experimental, can be distilled into three key features:
  • Publisher feedback - provide the owner of shared code fast feedback as to which of their consumers they just broke, both direct and transitive. Also, allow teams to block releases based on downstream breakages. Currently, our engineering culture puts sole responsibility on consumers to resolve these issues. By giving library owners feedback on the impact they have to the rest of Netflix, we expect them to take on additional responsibility.
  • Managed source - provide consumers with a means to safely increment library versions automatically as new versions are released. Since we are already testing each new library release against all downstreams, why not bump consumer versions and accelerate version adoption, safely.
  • Distributed refactoring - provide owners of shared code a means to quickly find and globally refactor consumers of their API. We have started by issuing pull requests en masse to all Git repositories containing a consumer of a particular Java API. We’ve run some early experiments and expect to invest more in this area going forward.
We are just starting our journey. Our publisher feedback service is currently being alpha tested by a number of service teams and we plan to broaden adoption soon, with managed source not far behind. Our initial experiments with distributed refactoring have helped us understand how best to rapidly change code globally. We also see an opportunity to reduce the size of the overall dependency graph by leveraging tools we build in this space. We believe that expanding and cultivating this capability will allow teams at Netflix to achieve true organization-wide continuous integration and reduce, if not eliminate, the pain of managing dependencies.
If this challenge is of interest to you, we are actively hiring for this team. You can apply using one of the links below:

- Mike McGarr, Dianne Marsh and the Developer Productivity team

Psyberg: Automated end to end catch up

$
0
0

By Abhinaya Shetty, Bharath Mummadisetty

This blog post will cover how Psyberg helps automate the end-to-end catchup of different pipelines, including dimension tables.

In the previous installments of this series, we introduced Psyberg and delved into its core operational modes: Stateless and Stateful Data Processing. Now, let’s explore the state of our pipelines after incorporating Psyberg.

Pipelines After Psyberg

Let’s explore how different modes of Psyberg could help with a multistep data pipeline. We’ll return to the sample customer lifecycle:

Processing Requirement:
Keep track of the end-of-hour state of accounts, e.g., Active/Upgraded/Downgraded/Canceled.

Solution:
One potential approach here would be as follows

  1. Create two stateless fact tables :
    a. Signups
    b. Account Plans
  2. Create one stateful fact table:
    a. Cancels
  3. Create a stateful dimension that reads the above fact tables every hour and derives the latest account state.

Let’s look at how this can be integrated with Psyberg to auto-handle late-arriving data and corresponding end-to-end data catchup.

Navigating the Workflow: How Psyberg Handles Late-Arriving Data

We follow a generic workflow structure for both stateful and stateless processing with Psyberg; this helps maintain consistency and makes debugging and understanding these pipelines easier. The following is a concise overview of the various stages involved; for a more detailed exploration of the workflow specifics, please turn to the second installment of this series.

1. Psyberg Initialization

The workflow starts with the Psyberg initialization (init) step.

  • Input: List of source tables and required processing mode
  • Output: Psyberg identifies new events that have occurred since the last high watermark (HWM) and records them in the session metadata table.

The session metadata table can then be read to determine the pipeline input.

2. Write-Audit-Publish (WAP) Process

This is the general pattern we use in our ETL pipelines.

a. Write
Apply the ETL business logic to the input data identified in Step 1 and write to an unpublished iceberg snapshot based on the Psyberg mode

b. Audit
Run various quality checks on the staged data. Psyberg’s metadata session table is used to identify the partitions included in a batch run. Several audits, such as verifying source and target counts, are performed on this batch of data.

c. Publish
If the audits are successful, cherry-pick the staging snapshot to publish the data to production.

3. Psyberg Commit

Now that the data pipeline has been executed successfully, the new high watermark identified in the initialization step is committed to Psyberg’s high watermark metadata table. This ensures that the next instance of the workflow will pick up newer updates.

Callouts

  • Having the Psyberg step isolated from the core data pipeline allows us to maintain a consistent pattern that can be applied across stateless and stateful processing pipelines with varying requirements.
  • This also enables us to update the Psyberg layer without touching the workflows.
  • This is compatible with both Python and Scala Spark.
  • Debugging/figuring out what was loaded in every run is made easy with the help of workflow parameters and Psyberg Metadata.

The Setup: Automated end-to-end catchup

Let’s go back to our customer lifecycle example. Once we integrate all four components with Psyberg, here’s how we would set it up for automated catchup.

The three fact tables, comprising the signup and plan facts encapsulated in Psyberg’s stateless mode, along with the cancel fact in stateful mode, serve as inputs for the stateful sequential load ETL pipeline. This data pipeline monitors the various stages in the customer lifecycle.

In the sequential load ETL, we have the following features:

  • Catchup Threshold: This defines the lookback period for the data being read. For instance, only consider the last 12 hours of data.
  • Data Load Type: The ETL can either load the missed/new data specifically or reload the entire specified range.
  • Metadata Recording: Metadata is persisted for traceability.

Here is a walkthrough on how this system would automatically catch up in the event of late-arriving data:

Premise: All the tables were last loaded up to hour 5, meaning that any data from hour 6 onwards is considered new, and anything before that is classified as late data (as indicated in red above)

Fact level catchup:

  1. During the Psyberg initialization phase, the signup and plan facts identify the late data from hours 2 and 3, as well as the most recent data from hour 6. The ETL then appends this data to the corresponding partitions within the fact tables.
  2. The Psyberg initialization for the cancel fact identifies late data from hour 5 and additional data from hours 6 and 7. Since this ETL operates in stateful mode, the data in the target table from hours 5 to 7 will be overwritten with the new data.
  3. By focusing solely on updates and avoiding reprocessing of data based on a fixed lookback window, both Stateless and Stateful Data Processing maintain a minimal change footprint. This approach ensures data processing is both efficient and accurate.

Dimension level catchup:

  1. The Psyberg wrapper for this stateful ETL looks at the updates to the upstream Psyberg powered fact tables to determine the date-hour range to reprocess. Here’s how it would calculate the above range:
    MinHr = least(min processing hour from each source table)
    This ensures that we don’t miss out on any data, including late-arriving data. In this case, the minimum hour to process the data is hour 2.
    MaxHr = least(max processing hour from each source table)
    This ensures we do not process partial data, i.e., hours for which data has not been loaded into all source tables. In this case, the maximum hour to process the data is hour 6.
  2. The ETL process uses this time range to compute the state in the changed partitions and overwrite them in the target table. This helps overwrite data only when required and minimizes unnecessary reprocessing.

As seen above, by chaining these Psyberg workflows, we could automate the catchup for late-arriving data from hours 2 and 6. The Data Engineer does not need to perform any manual intervention in this case and can thus focus on more important things!

The Impact: How Psyberg Transformed Our Workflows

The introduction of Psyberg into our workflows has served as a valuable tool in enhancing accuracy and performance. The following are key areas that have seen improvements from using Psyberg:

  • Computational Resources Used:
    In certain instances, we’ve noticed a significant reduction in resource utilization, with the number of Spark cores used dropping by 90% following the implementation of Psyberg, compared to using fixed lookback windows
  • Workflow and Table Onboarding:
    We have onboarded 30 tables and 13 workflows into incremental processing since implementing Psyberg
  • Reliability and Accuracy:
    Since onboarding workflows to Psyberg, we have experienced zero manual catchups or missing data incidents
  • Bootstrap template:
    The process of integrating new tables into incremental processing has been made more accessible and now requires minimal effort using Psyberg

These performance metrics suggest that adopting Psyberg has been beneficial to the efficiency of our data processing workflows.

Next Steps and Conclusion

Integrating Psyberg into our operations has improved our data workflows and opened up exciting possibilities for the future. As we continue to innovate, Netflix’s data platform team is focused on creating a comprehensive solution for incremental processing use cases. This platform-level solution is intended to enhance our data processing capabilities across the organization. You can read more about this here!

In conclusion, Psyberg has proven to be a reliable and effective solution for our data processing needs. As we look to the future, we’re excited about the potential for further advancements in our data platform capabilities.


Psyberg: Automated end to end catch up was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Viewing all 303 articles
Browse latest View live