Creating Your Own EC2 Spot Market -- Part 2

November 23, 2015, 7:00 am

≫ Next: Linux Performance Analysis in 60,000 Milliseconds

≪ Previous: Sleepy Puppy Extension for Burp Suite

In Part 1 Creating Your Own EC2 Spot Market of this series, we explained how Netflix manages its EC2 footprint and how we take advantage of our daily peak of 12,000 unused instances which we named the “internal spot market.” This sizeable trough has significantly improved our encoding throughput, and we are pursuing other benefits from this large pool of unused resources.

The Encoding team went through two iterations of internal spot market implementations. The initial approach was a simple schedule-based borrowing mechanism that was quickly deployed in June in the us-east AZ to reap immediate benefits. We applied the experience we gained to influence the next iteration of the design based on real-time availability.

The main challenge of using the spot instances effectively is handling the dynamic nature of our instance availability. With correct timing, running spot instances is effectively free; when the timing is off, however, any EC2 usage is billed at the on-demand price. In this post we will discuss how the real-time, availability-based internal spot market system works and efficiently uses the unused capacity

Benefits of Extra Capacity

The encoding system at Netflix is responsible for encoding master media source files into many different output formats and bitrates for all Netflix supported devices. A typical workload is triggered by source delivery, and sometimes the encoding system receives an entire season of a show within moments. By leveraging the internal spot market, we have measured the equivalent of a 210% increase in encoding capacity. With the extra boost of computing resources, we have improved our ability to handle sudden influx work and to quickly reduce our of backlog.

In addition to the production environment, the encoding infrastructure maintains 40 “farms” for development and testing. Each farm is a complete encoding system with 20+ micro-services that matches the capability and capacity of the production environment.

Computing resources are continuously evaluated and redistributed based on workload. With the boost of spot market instances, the total encoding throughput increases significantly. On the R&D side, researchers leverage these extra resources to carry out experiments in a fraction of the time it used to take. Our QA automation is able to broaden the coverage of our comprehensive suite of continuous integration and run these jobs in less time.

Spot Market Borrowing in Action

We started the new spot market system in October, and we are encouraged by the improved performance compared to our borrowing in the first iteration.

For instance, in one of the research projects, we triggered 12,000 video encoding jobs over a weekend. We had anticipated the work to finish in a few days, but we were pleasantly surprised to discover that the jobs were completed in only 18 hours.

The following graph captures that weekend’s activity.

The Y-axis denotes the amount of video encoder jobs queued in the messaging system, the red line represents high priority jobs, and the yellow area graph shows the amount of medium and low priority jobs.

Important Considerations

By launching on-demand instances in the Encoding team AWS account, the Encoding team never impacts guaranteed capacity (reserved instances) from the main Netflix account.
The Encoding team competes for on-demand instances with other Netflix accounts.
Spot instance availability fluctuates and can become unavailable at any moment. The encoding service needs to react to these changes swiftly.
It is possible to dip into unplanned on-demand usage due to sudden surge of instance usage in other Netflix accounts while we have internal spot instances running. The benefits of borrowing must significantly outweigh the cost of these on-demand charges.
Available spot capacity comes in different types and sizes. We can make the most out of them by making our jobs instance type agnostic.

Design Goals

Cost Effectiveness: Use as many spot instances as are available. Incur as little unplanned on-demand usage as possible.

Good Citizenship: We want to minimize contention that may cause a shortage in the on-demand pool. We take a light-handed approach by yielding spot instances to other Netflix accounts when there is competition on resources.

Automation: The Encoding team invests heavily in automation. The encoding system is responsible for encoding activities for the entire Netflix catalog 24x7, hands free. Spot market borrowing needs to function continuously and autonomously.

Metrics: Collect Atlas metrics to measure effectiveness, pinpoint areas of inefficiency, and trend usage patterns in near real-time.

Key Borrowing Strategies

We spend a great deal of the effort devising strategies to address the goals of Cost Effectiveness and Good Citizenship. We started with a set of simple assumptions, and then constantly iterated using our monitoring system, allowing us to validate and fine tune the initial design to the following set of strategies below:

Real-time Availability Based Borrowing: Closely align utilization based on the fluctuating real-time spot instance availability using a Spinnaker API. Spinnaker is a Continuous Delivery Platform that manages Netflix reservations and deployment pipelines. It is in the optimal position to know what instances are in use across all Netflix accounts.

Negative Surplus Monitor: Sample spot market availability, and quickly terminate (yield) borrowed instances when we detect overdraft of internal spot instances. It enforces that our spot borrowing is treated as the lowest priority usage in the company and leads to reduced on-demand contention.

Idle Instance Detection: Detect over-allocated spot instances. Accelerate down scaling of spot instances to improve time to release, with an additional benefit of reducing borrowing surface area.

Fair Distribution: When spot instances are abundant, distribute assignment evenly to avoid exhausting one EC2 instance type on a specific AZ. This helps minimize on-demand shortage and contention while reducing involuntary churn due to negative surplus.

Smoothing Function: The resource scheduler evaluates assignments of EC2 instances based on a smoothed representation of workload, smoothing out jitters and spikes to prevent over-reaction.

Incremental Stepping & Fast Evaluation Cycle: Acting in incremental steps avoids over-reaction and allows us to evaluate the workload frequently for rapid self correction. Incremental stepping also helps distribute instance usage across instance types and AZ more evenly.

Safety Margin: Reduce contention by leaving some amount of available spot instances unused. It helps reduce involuntary termination due to minor fluctuations in usage in other Netflix accounts.

Curfew: Preemptively reduce spot usage before a predictable pattern of negative surplus inflection that drops rapidly (e.g. Nightly Netflix personal recommendation computation schedule). These curfews help minimize preventable on-demand charges.

Evacuation Monitor: A system-wide toggle to immediately evacuate all borrowing usage in case of emergency (e.g. regional traffic failover). Eliminate on-demand contention in case of emergency.

Observations

The following graph depicts a five day span on spot usage by instance type.

This graph illustrates a few interesting points:

The variance in color represents different instance types in use, and in most cases the relatively even distribution of bands of color shows that instance type usage is reasonably balanced.
The sharp rise and drop of the peaks confirms that the encoding resource manager scales up and down relatively quickly in response to changes in workload.
The flat valleys show the frugality of instance usage. Spot instances are only used when there is work for them to do.
Not all color bands have the same height because the size of the reservation varies between instance types. However, we are able to borrow from both large (orange) and small (green) pools, collectively satisfying the entire workload.
Finally, although this graph reports instance usage, it indirectly tracks the workload. The overall shape of the graphs shows that there is no discernible pattern of the workload, such is the event driven nature of the encoding activities.

Efficiency

Based on the AWS billing data from October, we summed up all the borrowing hours and adjusted them relative to the r3.4xlarge instance type that makes up the Encoding reserved capacity. With the addition of spot market instances, the effective encoding capacity increased by 210%.

Dark blue denotes spot market borrowing, and light blue represents on-demand usage.

On-demand pricing is multiple times more expensive than reserved instances, and it varies depending on instance type. We took the October spot market usage and calculated what it would have cost with purely on-demand pricing and computed a 92% cost efficiency.

Lessons Learned

On-demand is Expensive: We already knew this fact, but the idea sinks in once we observed on-demand charges as a result of sudden overdrafts of spot usage. A number of the strategies (e.g. Safety Margin, Curfew) listed in the above section were devised to specifically mitigate this occurrence.

Versatility: Video encoding represents 70% of our computing needs. We made some tweaks to the video encoder to run on a much wider selection of instance types. As a result, we were able to leverage a vast number of spot market instances during different parts of the day.

Tolerance to Interruption: The encoding system is built to withstand interruptions. This attribute works well with the internal spot market since instances can be terminated at any time.

Next Steps

Although the current spot market borrowing system is a notable improvement over the previous attempt, we are uncovering the tip of the iceberg. In the future, we want to leverage spot market instances from different EC2 regions as they become available. We are also heavily investing in the next generation of encoding architecture that scales more efficiently and responsively. Here are some ideas we are exploring:

Cross Region Utilization: By borrowing from multiple EC2 regions, we triple the access to unused reservations from the current usable pool. Using multiple regions also significantly reduces concentration of on-demand usages in a single EC2 region.

Containerization: The current encoding system is based on ASG scaling. We are actively investing in the next generation of our encoding infrastructure using container technology. The container model will reduce overhead in ASG scaling, minimize overhead of churning, and increase performance and throughput as Netflix continues to grow its catalog.

Resource Broker: The current borrowing system is monopolistic in that it assumes the Encoding service is the sole borrower. It is relatively easy to implement for one borrower. We need to create a resource broker to better coordinate access to the spot surplus when sharing amongst multiple borrowers.

Conclusion

In the first month of deployment, we observed significant benefits in terms of performance and throughput. We were successful in making use of Netflix idle capacity for production, research, and QA. Our encoding velocity increased dramatically. Experimental research turn-around time was drastically reduced. A comprehensive full regression test finishes in half the time it used to take. With a cost efficiency of 92%, the spot market is not completely free but it is worth the cost.

All of these benefits translate to faster research turnaround, improved playback quality, and ultimately a better member experience.

-- Media Cloud Engineering

Contributors: Rick Wong, Darrell Denlinger, Abhishek Shiroor, Naveen Mareddy, Frank San Miguel, Rodrigo Gallardo, and Mangala Prabhu

↧

Linux Performance Analysis in 60,000 Milliseconds

November 30, 2015, 1:38 pm

≫ Next: Caching Content for Holiday Streaming

≪ Previous: Creating Your Own EC2 Spot Market -- Part 2

You login to a Linux server with a performance issue: what do you check in the first minute?

At Netflix we have a massive EC2 Linux cloud, and numerous performance analysis tools to monitor and investigate its performance. These include Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. While those tools help us solve most issues, we sometimes need to login to an instance and run some standard Linux performance tools.

In this post, the Netflix Performance Engineering team will show you the first 60 seconds of an optimized performance investigation at the command line, using standard Linux tools you should have available.

First 60 Seconds: Summary

In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands. Look for errors and saturation metrics, as they are both easy to interpret, and then resource utilization. Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting.


uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top

Some of these commands require the sysstat package installed. The metrics these commands expose will help you complete some of the USE Method: a methodology for locating performance bottlenecks. This involves checking utilization, saturation, and error metrics for all resources (CPUs, memory, disks, e.t.c.). Also pay attention to when you have checked and exonerated a resource, as by process of elimination this narrows the targets to study, and directs any follow on investigation.

The following sections summarize these commands, with examples from a production system. For more information about these tools, see their man pages.

1. uptime


$ uptime
 23:51:26 up 21:31,  1 user,  load average: 30.02, 26.43, 19.02

This is a quick way to view the load averages, which indicate the number of tasks (processes) wanting to run. On Linux systems, these numbers include processes wanting to run on CPU, as well as processes blocked in uninterruptible I/O (usually disk I/O). This gives a high level idea of resource load (or demand), but can’t be properly understood without other tools. Worth a quick look only.

The three numbers are exponentially damped moving sum averages with a 1 minute, 5 minute, and 15 minute constant. The three numbers give us some idea of how load is changing over time. For example, if you’ve been asked to check a problem server, and the 1 minute value is much lower than the 15 minute value, then you might have logged in too late and missed the issue.

In the example above, the load averages show a recent increase, hitting 30 for the 1 minute value, compared to 19 for the 15 minute value. That the numbers are this large means a lot of something: probably CPU demand; vmstat or mpstat will confirm, which are commands 3 and 4 in this sequence.

2. dmesg | tail


$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request.  Check SNMP counters.

This views the last 10 system messages, if there are any. Look for errors that can cause performance issues. The example above includes the oom-killer, and TCP dropping a request.

Don’t miss this step! dmesg is always worth checking.

3. vmstat 1


$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
34  0    0 200889792  73708 591828    0    0     0     5    6   10 96  1  3  0  0
32  0    0 200889920  73708 591860    0    0     0   592 13284 4282 98  1  1  0  0
32  0    0 200890112  73708 591860    0    0     0     0 9501 2154 99  1  0  0  0
32  0    0 200889568  73712 591856    0    0     0    48 11900 2459 99  0  0  0  0
32  0    0 200890208  73712 591860    0    0     0     0 15898 4840 98  1  1  0  0
^C

Short for virtual memory stat, vmstat(8) is a commonly available tool (first created for BSD decades ago). It prints a summary of key server statistics on each line.

vmstat was run with an argument of 1, to print one second summaries. The first line of output (in this version of vmstat) has some columns that show the average since boot, instead of the previous second. For now, skip the first line, unless you want to learn and remember which column is which.

Columns to check:

r: Number of processes running on CPU and waiting for a turn. This provides a better signal than load averages for determining CPU saturation, as it does not include I/O. To interpret: an “r” value greater than the CPU count is saturation.
free: Free memory in kilobytes. If there are too many digits to count, you have enough free memory. The “free -m” command, included as command 7, better explains the state of free memory.
si, so: Swap-ins and swap-outs. If these are non-zero, you’re out of memory.
us, sy, id, wa, st: These are breakdowns of CPU time, on average across all CPUs. They are user time, system time (kernel), idle, wait I/O, and stolen time (by other guests, or with Xen, the guest's own isolated driver domain).

The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time. A constant degree of wait I/O points to a disk bottleneck; this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. You can treat wait I/O as another form of CPU idle, one that gives a clue as to why they are idle.

System time is necessary for I/O processing. A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently.

In the above example, CPU time is almost entirely in user-level, pointing to application level usage instead. The CPUs are also well over 90% utilized on average. This isn’t necessarily a problem; check for the degree of saturation using the “r” column.

4. mpstat -P ALL 1


$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

07:38:49 PM  CPU   %usr  %nice   %sys %iowait   %irq  %soft  %steal  %guest  %gnice  %idle
07:38:50 PM  all  98.47   0.00   0.75    0.00   0.00   0.00    0.00    0.00    0.00   0.78
07:38:50 PM    0  96.04   0.00   2.97    0.00   0.00   0.00    0.00    0.00    0.00   0.99
07:38:50 PM    1  97.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   2.00
07:38:50 PM    2  98.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   1.00
07:38:50 PM    3  96.97   0.00   0.00    0.00   0.00   0.00    0.00    0.00    0.00   3.03
[...]

This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance. A single hot CPU can be evidence of a single-threaded application.

5. pidstat 1


$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

07:41:02 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:03 PM     0         9    0.00    0.94    0.00    0.94     1  rcuos/0
07:41:03 PM     0      4214    5.66    5.66    0.00   11.32    15  mesos-slave
07:41:03 PM     0      4354    0.94    0.94    0.00    1.89     8  java
07:41:03 PM     0      6521 1596.23    1.89    0.00 1598.11    27  java
07:41:03 PM     0      6564 1571.70    7.55    0.00 1579.25    28  java
07:41:03 PM 60004     60154    0.94    4.72    0.00    5.66     9  pidstat

07:41:03 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:04 PM     0      4214    6.00    2.00    0.00    8.00    15  mesos-slave
07:41:04 PM     0      6521 1590.00    1.00    0.00 1591.00    27  java
07:41:04 PM     0      6564 1573.00   10.00    0.00 1583.00    28  java
07:41:04 PM   108      6718    1.00    0.00    0.00    1.00     0  snmp-pass
07:41:04 PM 60004     60154    1.00    4.00    0.00    5.00     9  pidstat
^C

Pidstat is a little like top’s per-process summary, but prints a rolling summary instead of clearing the screen. This can be useful for watching patterns over time, and also recording what you saw (copy-n-paste) into a record of your investigation.

The above example identifies two java processes as responsible for consuming CPU. The %CPU column is the total across all CPUs; 1591% shows that that java processes is consuming almost 16 CPUs.

6. iostat -xz 1


$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          73.96    0.00    3.73    0.03    0.06   22.21

Device:   rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda        0.00     0.23    0.21    0.18     4.52     2.08    34.37     0.00    9.98   13.80    5.42   2.44   0.09
xvdb        0.01     0.00    1.02    8.94   127.97   598.53   145.79     0.00    0.43    1.78    0.28   0.25   0.25
xvdc        0.01     0.00    1.02    8.86   127.79   595.94   146.50     0.00    0.45    1.82    0.30   0.27   0.26
dm-0        0.00     0.00    0.69    2.32    10.47    31.69    28.01     0.01    3.23    0.71    3.98   0.13   0.04
dm-1        0.00     0.00    0.00    0.94     0.01     3.78     8.00     0.33  345.84    0.04  346.81   0.01   0.00
dm-2        0.00     0.00    0.09    0.07     1.35     0.36    22.50     0.00    2.55    0.23    5.62   1.78   0.03
[...]
^C

This is a great tool for understanding block devices (disks), both the workload applied and the resulting performance. Look for:

r/s, w/s, rkB/s, wkB/s: These are the delivered reads, writes, read Kbytes, and write Kbytes per second to the device. Use these for workload characterization. A performance problem may simply be due to an excessive load applied.
await: The average time for the I/O in milliseconds. This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or device problems.
avgqu-sz: The average number of requests issued to the device. Values greater than 1 can be evidence of saturation (although devices can typically operate on requests in parallel, especially virtual devices which front multiple back-end disks.)
%util: Device utilization. This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device. Values close to 100% usually indicate saturation.

If the storage device is a logical disk device fronting many back-end disks, then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back-end disks may be far from saturated, and may be able to handle much more work.

Bear in mind that poor performing disk I/O isn’t necessarily an application issue. Many techniques are typically used to perform I/O asynchronously, so that the application doesn’t block and suffer the latency directly (e.g., read-ahead for reads, and buffering for writes).

7. free -m


$ free -m
             total       used       free     shared    buffers     cached
Mem:        245998      24545     221453         83         59        541
-/+ buffers/cache:      23944     222053
Swap:            0          0          0

The right two columns show:

buffers: For the buffer cache, used for block device I/O.
cached: For the page cache, used by file systems.

We just want to check that these aren’t near-zero in size, which can lead to higher disk I/O (confirm using iostat), and worse performance. The above example looks fine, with many Mbytes in each.

The “-/+ buffers/cache” provides less confusing values for used and free memory. Linux uses free memory for the caches, but can reclaim it quickly if applications need it. So in a way the cached memory should be included in the free memory column, which this line does. There’s even a website, linuxatemyram, about this confusion.

It can be additionally confusing if ZFS on Linux is used, as we do for some services, as ZFS has its own file system cache that isn’t reflected properly by the free -m columns. It can appear that the system is low on free memory, when that memory is in fact available for use from the ZFS cache as needed.

8. sar -n DEV 1


$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015     _x86_64_    (32 CPU)

12:16:48 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:49 AM      eth0  18763.00   5032.00  20686.42    478.30      0.00      0.00      0.00      0.00
12:16:49 AM        lo     14.00     14.00      1.36      1.36      0.00      0.00      0.00      0.00
12:16:49 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:16:49 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:50 AM      eth0  19763.00   5101.00  21999.10    482.56      0.00      0.00      0.00      0.00
12:16:50 AM        lo     20.00     20.00      3.25      3.25      0.00      0.00      0.00      0.00
12:16:50 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
^C

Use this tool to check network interface throughput: rxkB/s and txkB/s, as a measure of workload, and also to check if any limit has been reached. In the above example, eth0 receive is reaching 22 Mbytes/s, which is 176 Mbits/sec (well under, say, a 1 Gbit/sec limit).

This version also has %ifutil for device utilization (max of both directions for full duplex), which is something we also use Brendan’s nicstat tool to measure. And like with nicstat, this is hard to get right, and seems to not be working in this example (0.00).

9. sar -n TCP,ETCP 1


$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

12:17:19 AM  active/s passive/s    iseg/s    oseg/s
12:17:20 AM      1.00      0.00  10233.00  18846.00

12:17:19 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:20 AM      0.00      0.00      0.00      0.00      0.00

12:17:20 AM  active/s passive/s    iseg/s    oseg/s
12:17:21 AM      1.00      0.00   8359.00   6039.00

12:17:20 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:21 AM      0.00      0.00      0.00      0.00      0.00
^C

This is a summarized view of some key TCP metrics. These include:

active/s: Number of locally-initiated TCP connections per second (e.g., via connect()).
passive/s: Number of remotely-initiated TCP connections per second (e.g., via accept()).
retrans/s: Number of TCP retransmits per second.

The active and passive counts are often useful as a rough measure of server load: number of new accepted connections (passive), and number of downstream connections (active). It might help to think of active as outbound, and passive as inbound, but this isn’t strictly true (e.g., consider a localhost to localhost connection).

Retransmits are a sign of a network or server issue; it may be an unreliable network (e.g., the public Internet), or it may be due a server being overloaded and dropping packets. The example above shows just one new TCP connection per-second.

10. top


$ top
top - 00:15:40 up 21:56,  1 user,  load average: 31.09, 29.87, 29.92
Tasks: 871 total,   1 running, 868 sleeping,   0 stopped,   2 zombie
%Cpu(s): 96.8 us,  0.4 sy,  0.0 ni,  2.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  25190241+total, 24921688 used, 22698073+free,    60448 buffers
KiB Swap:        0 total,        0 used,        0 free.   554208 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 20248 root      20   0  0.227t 0.012t  18748 S  3090  5.2  29812:58 java
  4213 root      20   0 2722544  64640  44232 S  23.5  0.0 233:35.37 mesos-slave
 66128 titancl+  20   0   24344   2332   1172 R   1.0  0.0   0:00.07 top
  5235 root      20   0 38.227g 547004  49996 S   0.7  0.2   2:02.74 java
  4299 root      20   0 20.015g 2.682g  16836 S   0.3  1.1  33:14.42 java
     1 root      20   0   33620   2920   1496 S   0.0  0.0   0:03.82 init
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd
     3 root      20   0       0      0      0 S   0.0  0.0   0:05.35 ksoftirqd/0
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     6 root      20   0       0      0      0 S   0.0  0.0   0:06.94 kworker/u256:0
     8 root      20   0       0      0      0 S   0.0  0.0   2:38.05 rcu_sched

The top command includes many of the metrics we checked earlier. It can be handy to run it to see if anything looks wildly different from the earlier commands, which would indicate that load is variable.

A downside to top is that it is harder to see patterns over time, which may be more clear in tools like vmstat and pidstat, which provide rolling output. Evidence of intermittent issues can also be lost if you don’t pause the output quick enough (Ctrl-S to pause, Ctrl-Q to continue), and the screen clears.

Follow-on Analysis

There are many more commands and methodologies you can apply to drill deeper. See Brendan’s Linux Performance Tools tutorial from Velocity 2015, which works through over 40 commands, covering observability, benchmarking, tuning, static performance tuning, profiling, and tracing.

Tackling system reliability and performance problems at web scale is one of our passions. If you would like to join us in tackling these kinds of challenges we are hiring!

↧

Caching Content for Holiday Streaming

December 1, 2015, 10:00 pm

≫ Next: Debugging Node.js in Production

≪ Previous: Linux Performance Analysis in 60,000 Milliseconds

'Tis the season for holiday binging. How do seasonal viewing patterns affect how Netflix stores and streams content?

Our solution for delivering streaming content is Open Connect, a custom-built system that distributes and stores the audio and video content our members download when they stream a Netflix title (a movie or episode). Netflix has a unique library of large files with relatively predictable popularity, and Open Connect's global, distributed network of caching servers was designed with these attributes in mind. This system localizes content as close to our members as possible to achieve a high-quality playback experience through low latency access to content over optimal internet paths. A subset of highly-watched titles makes up a significant share of total streaming, and caching, the process of storing content based on how often it’s streamed by our members, is critical to ensuring enough copies of a popular title are available in a particular location to support the demand of all the members who want to stream it.

We curate rich, detailed data on what our members are watching, giving us a clear signal of which content is popular today. We enrich that signal with algorithms to produce a strong indicator of what will be popular tomorrow. As a title increases in popularity, more copies of it are added to our caching servers, replacing other, less popular content on a nightly cadence when the network is least busy. Deleting and adding files from servers comes with overhead, however, and we perform these swaps of content with the help of algorithms designed to balance cache efficiency with the network cost of replacing content.

How does this play out for titles with highly seasonal patterns? Metadata assembled by human taggers and reviewed by our internal enhanced content team tells us which titles are holiday-related, and using troves of streaming data we can track their popularity throughout the year. Holiday titles ramp in popularity starting in November, so more copies of these titles will be distributed among the network starting in November through their popularity peak at the end of December. The cycle comes full circle when the holiday content is displaced by relatively more popular titles in January.

Holiday viewing follows a predictable annual pattern, but we also have to deal with less predictable scenarios like introducing new shows without any viewing history or external events that suddenly drive up the popularity of certain titles. For new titles, we model a combination of external and internal data points to create a predicted popularity, allowing us to appropriately cache that content before the first member ever streams it. For unexpected spikes in popularity driven by events like actors popping up in the news, we are designing mechanisms to let us quickly push content to our caches outside of the nightly replacement schedule as they are actively serving members. We're also exploring ways to evaluate popularity more locally; what's popular in Portland may not be what's popular in Philadelphia.

Whether your tastes run toward Love Actually or The Nightmare before Christmas, your viewing this holiday season provides valuable information to help optimize how Netflix stores its content.

Love data, measurement and Netflix? Join us!

-Laura Pruitt

↧

Debugging Node.js in Production

December 3, 2015, 3:28 pm

≫ Next: High Quality Video Encoding at Scale

≪ Previous: Caching Content for Holiday Streaming

By Kim Trott, Yunong Xiao

We recently hosted our latest JavaScript Talks event on our new campus at Netflix headquarters in Los Gatos, California. Yunong Xiao, senior software engineer on our Node.js platform, presented on debugging Node.js in production. Yunong showed hands-on techniques using the scientific method to root cause and solve for runtime performance issues, crashes, errors, and memory leaks.

We’ve shared some useful links from our talk:

Videos from our past talks can always be found on our Netflix UI Engineering channel on YouTube. If you’re interested in being notified of future events, just sign up on our notification list.

↧

High Quality Video Encoding at Scale

December 9, 2015, 10:28 am

≫ Next: Optimizing Content Quality Control at Netflix with Predictive Modeling

≪ Previous: Debugging Node.js in Production

At Netflix we receive high quality sources for our movies and TV shows and encode them to the best video streams possible for a given member’s viewing device and bandwidth capabilities. With the continued growth of our service it has been essential to build a video encoding pipeline that is highly robust, efficient and scalable. Our production system is designed to easily scale to support the demands of the business (i.e., more titles, more video encodes, shorter time to deploy), while guaranteeing a high quality of experience for our members.

Pipeline in the Cloud

The video encoding pipeline runs EC2 Linux cloud instances. The elasticity of the cloud enables us to seamlessly scale up when more titles need to be processed, and scale down to free up resources. Our video processing applications don’t require any special hardware and can run on a number of EC2 instance types. Long processing jobs are divided into smaller tasks and parallelized to reduce end-to-end delay and local storage requirements. It also allows us to exploit our internal spot market where instances are dynamically allocated based on real-time availability of the compute resources. If a task does not complete because an instance is abruptly terminated, only a small amount of work is lost and the task is rescheduled for another instance. The ability to recover from these transient errors is essential for a robust cloud-based system.

The figure below shows a high-level overview of our system. We ingest high quality video sources and generate video encodes of various codec profiles, at multiple quality representations per profile. The encodes are packaged and then deployed to a content delivery network for streaming. During a streaming session, the client requests the encodes it can play and adaptively switches among quality levels based on network conditions.

Video Source Inspection

To ensure that we have high quality output streams, we need pristine video sources. Netflix ingests source videos from our originals production houses or content partners. In some undesirable cases, the delivered source video contains distortion or artifacts which would result in bad quality video encodes – garbage in means garbage out. These artifacts may have been introduced by multiple processing and transcoding steps before delivery, data corruption during transmission or storage, or human errors during content production. Rather than fixing the source video issues after ingest (for example, apply error concealment to corrupted frames or re-edit sources which contain extra content), Netflix rejects the problematic source video and requests redelivery. Rejecting problematic sources ensures that:

The best source video available is ingested into the system. In many cases, error mitigation techniques only partially fix the problem.
Complex algorithms (which could have been avoided by better processes upstream) do not unnecessarily burden the Netflix ingest pipeline.
Source issues are detected early where a specific and actionable error can be raised.
Content partners are motivated to triage their production pipeline and address the root causes of the problems. This will lead to improved video source deliveries in the future.

Our preferred source type is Interoperable Master Format (IMF). In addition we support ProRes, DPX, and MPEG (typically older sources). During source inspection, we 1) verify that the source is conformed to the relevant specification(s), 2) detect content that could lead to a bad viewing experience and 3) generate metadata required by the encoding pipeline. If the inspection deems the source unacceptable, the system automatically informs our content partner about issues and requests a redelivery of the source.

A modern 4K source file can be quite large. Larger, in fact, than a typical drive on an EC2 instance. In order to efficiently support these large source files, we must run the inspection on the file in smaller chunks. This chunked model lends itself to parallelization. As shown in the more detailed diagram below, an initial inspection step is performed to index the source file, i.e. determine the byte offsets for frame-accurate seeking, and generate basic metadata such as resolution and frame count. The file segments are then processed in parallel on different instances. For each chunk, bitstream-level and pixel-level analysis is applied to detect errors and generate metadata such as temporal and spatial fingerprints. After all the chunks are inspected, the results are assembled by the inspection aggregator to determine whether the source should be allowed into the encoding pipeline. With our highly optimized inspection workflow, we can inspect a 4K source in less than 15 minutes. Note that longer duration sources would have more chunks, so the total inspection time will still be less than 15 minutes.

Parallel Video Encoding

At Netflix we stream to a heterogenous set of viewing devices. This requires a number of codec profiles: VC1, H.264/AVC Baseline, H.264/AVC Main and HEVC. We also support varying bandwidth scenarios for our members, all the way from sub-0.5 Mbps cellular to 100+ Mbps high-speed Internet. To deliver the best experience, we generate multiple quality representations at different bitrates (ranging from 100 kbps to 16 Mbps) and the Netflix client adaptively selects the optimal stream given the instantaneous bandwidth.

Similar to inspection, encoding is performed on chunks of the source file, which allows for efficient parallelization. Since we strive for quality control at every step of the pipeline, we verify the correctness of each encoded chunk right after it completes encoding. If a problem is detected, we can immediately triage the problem (or in the case of transient errors, resubmit the task) without waiting for the entire video to complete. When all the chunks corresponding to a stream have successfully completed, they are stitched together by a video assembler. To guard against frame accuracy issues that may have been introduced by incorrect parallel encoding (for example, chunks assembled in the wrong order, or frames dropped or duplicated at chunk boundaries), we validate the assembled stream by comparing the spatial and temporal fingerprints of the encode with that of the source video (fingerprints of the source are generated during the inspection stage).

In addition to straightforward encoding, the system calculates multiple full-reference video quality metrics for each output video stream. By automatically generating quality scores for each encode, we can monitor video quality at scale. The metrics also help pinpoint bugs in the system and guide us in finding areas for improving our encode recipes. We will provide more detail on the quality metrics we utilize in our pipeline in a future blog post.

Quality of Service

Before we implemented parallel chunked encoding, a 1080p movie could take days to encode, and a failure occurring late in the process would delay the encode even further. With our current pipeline, a title can be fully inspected and encoded at the different profiles and quality representations, with automatic quality control checks, within a few hours. This enables us to stream titles within just a few hours of their original broadcast. We are currently working on further improvements to our system which will allow us to inspect and encode a 1080p source in 30 minutes or less. Note that since the work is done in parallel, processing time is not increased for longer sources.

Before automated quality checks were integrated into our system, encoding issues (picture corruption, inserted black frames, frame rate conversion, interlacing artifacts, frozen frames, etc) could go unnoticed until reported by Netflix members through Customer Support. Not only was this a poor member experience, triaging these issues was costly and inefficient, often escalating through many teams before the root cause was found. In addition, encoding failures (for example due to corrupt sources) would also require manual intervention and long delays in root-causing the failure. With our investment in automated inspection at scale, we detect the issues early, whether it was because of a bad source delivery, an implementation bug, or a glitch in one of the cloud instances, and we provide specific and actionable error messages. For a source that passes our inspections, we have an encode reliability of 99.99% or better. When we do find a problem that was not caught by our algorithms, we design new inspections to detect those issues in the future.

In Summary

High quality video streams are essential for delivering a great Netflix experience to our members. We have developed, and continue to improve on, a video ingest and encode pipeline that runs on the cloud reliably and at scale. We designed for automated quality control checks throughout so that we fail fast and detect issues early in the processing chain. Video is processed in parallel segments. This decreases end-to-end processing delay, reduces the required local storage and improves the system’s error resilience. We have invested in integrating video quality metrics into the pipeline so that we can continuously monitor performance and further optimize our encoding.

Our encoding pipeline, combined with the compute power of the Netflix internal spot market, has value outside our day-to-day production operations. We leverage this system to run large-scale video experiments (codec comparisons, encode recipe optimizations, quality metrics design, etc.) which strive to answer questions that are important to delivering the highest quality video streams, and at the same time could benefit the larger video research community.

by Anne Aaron and David Ronca

↧

Optimizing Content Quality Control at Netflix with Predictive Modeling

December 10, 2015, 11:11 am

≫ Next: Per-Title Encode Optimization

≪ Previous: High Quality Video Encoding at Scale

By Nirmal Govind and Athula Balachandran

Over 69 million Netflix members stream billions of hours of movies and shows every month in North and South America, parts of Europe and Asia, Australia and New Zealand. Soon, Netflix will be available in every corner of the world with an even more global member base.

As we expand globally, our goal is to ensure that every member has a high-quality experience every time they stream content on Netflix. This challenging problem is impacted by factors that include quality of the member's Internet connection, device characteristics, content delivery network, algorithms on the device, and quality of content.

We previously looked at opportunities to improve the Netflix streaming experience using data science. In this post, we'll focus on predictive modeling to optimize the quality control (QC) process for content at Netflix.

Content Quality

An important aspect of the streaming experience is the quality of the video, audio, and text (subtitle, closed captions) assets that are used.

Imagine sitting down to watch the first episode of a new season of your favorite show, only to find that the video and audio are off by 20 seconds. You decide to watch it anyway and turn on subtitles to follow along. What if the subtitles are poorly positioned and run off the screen?

Depending on the severity of the issue, you may stop watching, or continue because you’re already invested in the content. Either way, it leaves a bad impression and can negatively impact member satisfaction and retention. Netflix sets a high bar on content quality and has a QC process in place to ensure this bar is met. Let’s take a quick look at how the Netflix digital supply chain works and the role of the QC process.

We receive assets either from the content owners (e.g. studios, documentary filmmakers) or from a fulfillment house that obtains content from the owners and packages the assets for delivery to Netflix. Our QC process consists of automated and manual inspections to identify and replace assets that do not meet our specified quality standards.

Automated inspections are performed before and after the encoding process that compresses the larger “source” files into a set of smaller encoded distribution files (at different bitrates, for different devices, etc.). Manual QC is then done to check for issues easily detected with the human eye: depending on the content, a QCer either spot checks selected points of the movie or show, or watches the entire duration of the content. Examples of issues caught during the QC process include video interlacing artifacts, audio-video sync issues, and text issues such as missing or poorly placed subtitles.

It is worth noting the fraction of assets that fail quality checks is small. However, to optimize the streaming experience, we’re focused on detecting and replacing those sub-par assets. This is even more important as Netflix expands globally and more members consume content in a variety of new languages (both dubbed audio and subtitles). Also, we may receive content from new partners who have not delivered to us before and are not familiar with our quality standards.

Predictive Quality Control

As the Netflix catalog, member base, and global reach grow, it is important to scale the manual QC process by identifying defective assets accurately and efficiently.

Looking at the data

Data and data science play a key role in how Netflix operates, so the natural question to ask was:

Can we use data science to help identify defective assets?

We looked at the data on manual QC failures and observed that certain factors affected the likelihood of an asset failing QC. For example, some combinations of content and fulfillment partners had a higher rate of defects for certain types of assets. Metadata related to the content also showed patterns of failure. For example, older content (by release year) had a higher defect rate, likely due to the use of older formats for the creation and storage of assets. The genre of the content also exhibited certain patterns of failure.

These types of factors were used to build a machine learning model that predicts the probability that a delivered asset would not meet the Netflix quality standards.

A predictive model to identify defective assets helps in two significant ways:

Scale the content QC process by reducing QC effort on assets that are not defective.
Improve member experience by re-allocating resources to the discovery of hard-to-find quality issues that may otherwise be missed due to spot checks.

Machine Learning

Using results from past manual QC checks, a supervised machine learning (ML) approach was used to train a predictive quality control model that predicts a “fail” (likely has content quality issue) or “pass.” If an asset is predicted to fail QC, it is sent to manual QC. The modified supply chain workflow with the predictive QC model is shown below.

Netflix Supply Chain with Predictive Quality Control

A key goal of the model is to identify all defective assets even if this results in extra manual checks. Hence, we tuned the model for low false-negative rate (i.e. fewer uncaught defects) at the cost of increased false-positive rate.

Given that only a small fraction of the delivered assets are defective, one of the main challenges is class imbalance in the training data, i.e. we have a lot more data on “pass” assets than “fail” assets. We tackled this by using cost-sensitive training that heavily penalizes misclassification of the minority class (i.e. defective assets).

As with most model-building exercises, domain knowledge played an important role in this project. An observation that led to improved model performance was that defective assets are typically delivered in batches. For example, video assets from episodes within the same season of a show are mostly defective or mostly non-defective. It’s likely that assets in a batch were created or packaged around the same time and/or with the same equipment, and hence with similar defects.

We performed offline validation of the model by passively making predictions on incoming assets and comparing with actual results from manual QC. This allowed us to fine tune the model parameters and validate the model before deploying into production. Offline validation also confirmed the scaling and quality improvement benefits outlined earlier.

Looking Ahead

Predictive QC is a significant step forward in ensuring that members have an amazing viewing experience every time they watch a movie or show on Netflix. As the slate of Netflix Originals grows and more aspects of content creation—for example, localization, including subtitling and dubbing—are owned by Netflix, there is opportunity to further use data to improve content quality and the member experience.

We’re continuously innovating with data to build creative models and algorithms that improve the streaming experience for Netflix members. The scale of problems we encounter—Netflix accounts for 37.1% of North American downstream traffic at peak—provides for a set of unique modeling challenges. Also, we partner closely with the engineering teams to design and build production systems that embed such machine learning models. If you're interested in working in this exciting space, please check out the Streaming Science & Algorithms and Content Platform Engineering positions on the Netflix jobs site.

↧

Per-Title Encode Optimization

December 14, 2015, 12:30 pm

≫ Next: An update to our Windows 10 app

≪ Previous: Optimizing Content Quality Control at Netflix with Predictive Modeling

We’ve spent years developing an approach, called per-title encoding, where we run analysis on an individual title to determine the optimal encoding recipe based on its complexity. Imagine having very involved action scenes that need more bits to encapsulate the information versus unchanging landscape scenes or animation that need less. This allows us to deliver the same or better experience while using less bandwidth, which will be particularly important in lower bandwidth countries and as we expand to places where video viewing often happens on mobile networks.

Background

In traditional terrestrial, cable or satellite TV, broadcasters have an allocated bandwidth and the program or set of programs are encoded such that the resulting video streams occupy the given fixed capacity. Statistical multiplexing is oftentimes employed by the broadcaster to efficiently distribute the bitrate among simultaneous programs. However, the total accumulated bitrate across the programs should still fit within the limited capacity. In many cases, padding is even added using null packets to guarantee strict constant bitrate for the fixed channel, thus wasting precious data rate. Furthermore, with pre-set channel allocations, less popular programs or genres may be allocated lower bitrates (and therefore, worse quality) than shows that are viewed by more people.

With the advantages of Internet streaming, Netflix is not bound to pre-allocated channel constraints. Instead, we can deliver the best video quality stream to a member, no matter what the program or genre, tailored to the member’s available bandwidth and viewing device capability. We pre-encode streams at various bitrates applying optimized encoding recipes. On the member’s device, the Netflix client runs adaptive streaming algorithms which instantaneously select the best encode to maximize video quality while avoiding playback interruptions due to rebuffers.

Encoding with the best recipe is not a simple problem. For example, assuming a 1 Mbps bandwidth, should we stream H.264/AVC at 480p, 720p or 1080p? With 480p, 1 Mbps will likely not exhibit encoding artifacts such blocking or ringing, but if the member is watching on an HD device, the upsampled video will not be sharp. On the other hand, if we encode at 1080p we send a higher resolution video, but the bitrate may be too low such that most scenes will contain annoying encoding artifacts.

The Best Recipe for All

When we first deployed our H.264/AVC encodes in late 2010, our video engineers developed encoding recipes that worked best across our video catalogue (at that time). They tested various codec configurations and performed side-by-side visual tests to settle on codec parameters that produced the best quality trade-offs across different types of content. A set of bitrate-resolution pairs (referred to as a bitrate ladder), listed below, were selected such that the bitrates were sufficient to encode the stream at that resolution without significant encoding artifacts:

Bitrate (kbps)	Resolution
235	320x240
375	384x288
560	512x384
750	512x384
1050	640x480
1750	720x480
2350	1280x720
3000	1280x720
4300	1920x1080
5800	1920x1080

This “one-size-fits-all” fixed bitrate ladder achieves, for most content, good quality encodes given the bitrate constraint. However, for some cases, such as scenes with high camera noise or film grain noise, the highest 5800 kbps stream would still exhibit blockiness in the noisy areas. On the other end, for simple content like cartoons, 5800 kbps is far more than needed to produce excellent 1080p encodes. In addition, a customer whose network bandwidth is constrained to 1750 kbps might be able to watch the cartoon at HD resolution, instead of the SD resolution specified by the ladder above.

The titles in Netflix’s video collection have very high diversity in signal characteristics. In the graph below we present a depiction of the diversity of 100 randomly sampled titles. We encoded 100 sources at 1080p resolution using x264 constant QP (Quantization Parameter) rate control. At each QP point, for every title, we calculate the resulting bitrate in kbps, shown on the x-axis, and PSNR (Peak Signal-To-Noise Ratio) in dB, shown on the y-axis, as a measure of video quality.

The plots show that some titles reach very high PSNR (45 dB or more) at bitrates of 2500 kbps or less. On the other extreme, some titles require bitrates of 8000 kbps or more to achieve an acceptable PSNR of 38 dB.

Given this diversity, a one-size-fits-all scheme obviously cannot provide the best video quality for a given title and member’s allowable bandwidth. It can also waste storage and transmission bits because, in some cases, the allocated bitrate goes beyond what is necessary to achieve a perceptible improvement in video quality.

Side Note on Quality Metrics: For the above figure, and many of the succeeding plots, we plot PSNR as the measure of quality. PSNR is the most commonly used metric in video compression. Although PSNR does not always reflect perceptual quality, it is a simple way to measure the fidelity to the source, gives good indication of quality at the high and low ends of the range (i.e. 45 dB is very good quality, 35 dB will show encoding artifacts), and is a good indication of quality trends within a single title.The analysis can also be applied using other quality measures such as the VMAF perceptual metric. VMAF (Video Multi-Method Assessment Fusion) is a perceptual quality metric developed by Netflix in collaboration with University of Southern California researchers. We will publish details of this quality metric in a future blog.

The Best Recipe for the Content

Why Per-Title?

Consider an animation title where the content is “simple”, that is, the video frames are composed mostly of flat regions with no camera or film grain noise and minimal motion between frames. We compare the quality curve for the fixed bitrate ladder with a bitrate ladder optimized for the specific title:

As shown in the figure above, encoding this video clip at 1920x1080, 2350 kbps (A) produces a high quality encode, and adding bits to reach 4300 kbps (B) or even 5800 kbps (C) will not deliver a noticeable improvement in visual quality (for encodes with PSNR 45 dB or above, the distortion is perceptually unnoticeable). In the fixed bitrate ladder, for 2350 kbps, we encode at 1280x720 resolution (D). Therefore members with bandwidth constraints around that point are limited to 720p video instead of the better quality 1080p video.

On the other hand, consider an action movie that has significantly more temporal motion and spatial texture than the animation title. It has scenes with fast-moving objects, quick scene changes, explosions and water splashes. The graph below shows the quality curve of an action movie.

Encoding these high complexity scenes at 1920x1080, 4300 kbps (A), would result in encoding artifacts such as blocking, ringing and contouring. A better quality trade-off would be to encode at a lower resolution 1280x720 (B), to eliminate the encoding artifacts at the expense of adding scaling. Encoding artifacts are typically more annoying and visible than blurring introduced by downscaling (before the encode) then upsampling at the member’s device. It is possible that for this title with high complexity scenes, it would even be beneficial to encode 1920x1080 at a bitrate beyond 5800 kbps, say 7500 kbps, to eliminate the encoding artifacts completely.

To deliver the best quality video to our members, each title should receive a unique bitrate ladder, tailored to its specific complexity characteristics. Over the last few years, the encoding team at Netflix invested significant research and engineering to investigate and answer the following questions:

Given a title, how many quality levels should be encoded such that each level produces a just-noticeable-difference (JND)?
Given a title, what is the best resolution-bitrate pair for each quality level?
Given a title, what is the highest bitrate required to achieve the best perceivable quality?
Given a video encode, what is the human perceived quality?
How do we design a production system that can answer the above questions in a robust and scalable way?

The Algorithm

To design the optimal per-title bitrate ladder, we select the total number of quality levels and the bitrate-resolution pair for each quality level according to several practical constraints. For example, we need backward-compatibility (streams are playable on all previously certified Netflix devices), so we limit the resolution selection to a finite set -- 1920x1080, 1280x720, 720x480, 512x384, 384x288 and 320x240. In addition, the bitrate selection is also limited to a finite set, where the adjacent bitrates have an increment of roughly 5%.

We also have a number of optimality criteria that we consider.

The selected bitrate-resolution pair should be efficient, i.e. at a given bitrate, the produced encode should have as high quality as possible.
Adjacent bitrates should be perceptually spaced. Ideally, the perceptual difference between two adjacent bitrates should fall just below one JND. This ensures that the quality transitions can be smooth when switching between bitrates. It also ensures that the least number of quality levels are used, given a wide range of perceptual quality that the bitrate ladder has to span.

To build some intuition, consider the following example where we encode a source at three different resolutions with various bitrates.

Encoding at three resolutions and various bitrates. Blue marker depicts encoding point and the red curve indicates the PSNR-bitrate convex hull.

At each resolution, the quality of the encode monotonically increases with the bitrate, but the curve starts flattening out (A and B) when the bitrate goes above some threshold. This is because every resolution has an upper limit in the perceptual quality it can produce. When a video gets downsampled to a low resolution for encoding and later upsampled to full resolution for display, its high frequency components get lost in the process.

On the other hand, a high-resolution encode may produce a quality lower than the one produced by encoding at the same bitrate but at a lower resolution (see C and D). This is because encoding more pixels with lower precision can produce a worse picture than encoding less pixels at higher precision combined with upsampling and interpolation. Furthermore, at very low bitrates the encoding overhead associated with every fixed-size coding block starts to dominate in the bitrate consumption, leaving very few bits for encoding the actual signal. Encoding at high resolution at insufficient bitrate would produce artifacts such as blocking, ringing and contouring.

Based on the discussion above, we can draw a conceptual plot to depict the bitrate-quality relationship for any video source encoded at different resolutions, as shown below:

We can see that each resolution has a bitrate region in which it outperforms other resolutions. If we collect all these regions from all the resolutions available, they collectively form a boundary called convex hull. In an economic sense, the convex hull is where the encoding point achieves Pareto efficiency. Ideally, we want to operate exactly at the convex hull, but due to practical constraints (for example, we can only select from a finite number of resolutions), we would like to select bitrate-resolution pairs that are as close to the convex hull as possible.

It is practically infeasible to construct the full bitrate-quality graphs spanning the entire quality region for each title in our catalogue. To implement a practical solution in production, we perform trial encodings at different quantization parameters (QPs), over a finite set of resolutions. The QPs are chosen such that they are one JND apart. For each trial encode, we measure the bitrate and quality. By interpolating curves based on the sample points, we produce bitrate-quality curves at each candidate resolution. The final per-title bitrate ladder is then derived by selecting points closest to the convex hull.

Sample Results

BoJack Horseman is an example of an animation with simple content - flat regions and low motion from frame to frame. In the fixed bitrate ladder scheme, we use 1750 kbps for the 480p encode. For this particular episode, with the per-title recipe we start streaming 1080p video at 1540 kbps. Below we compare cropped screenshots (assuming a 1080p display) from the two versions (top: 1750 kbps, bottom: new 1540 kbps). The new encode is crisper and has better visual quality.

Orange is the New Black has video characteristics with more average complexity. At the low bitrate range, there is no significant quality improvement seen with the new scheme. At the high end, the new per-title encoding assigns 4640 kbps for the highest quality 1080p encode. This is 20% in bitrate savings compared to 5800 kbps for the fixed ladder scheme. For this title we avoid wasting bits but maintain the same excellent visual quality for our members. The images below show a screenshot at 5800 kbps (top) vs. 4640 kbps (bottom).

The Best Recipe for Your Device

In the description above where we select the optimized per-title bitrate ladder, there is an inherent assumption that the viewing device can receive and play any of the encoded resolutions. However, because of hardware constraints, some devices may be limited to resolutions lower than the original resolution of the source content. If we select the convex hull covering resolutions up to 1080p, this could lead to suboptimal viewing experiences for, say, a tablet limited to 720p decoding hardware. For example, given an animation title, we may switch to 1080p at 2000 kbps because it results in better quality than a 2000 kbps 720p stream. However the tablet will not be able to utilize the 1080p encode and would be constrained to a sub-2000 kbps stream even if the bandwidth allows for a better quality 720p encode.

To remedy this, we design additional per-title bitrate ladders corresponding to the maximum playable resolution on the device. More specifically, we design additional optimal per-title bitrate ladders tailored to 480p and 720p-capped devices. While these extra encodes reduce the overall storage efficiency for the title, adding them ensures that our customers have the best experience.

What does this mean for my Netflix shows?

Per-title encoding allows us to deliver higher quality video two ways: Under low-bandwidth conditions, per-title encoding will often give you better video quality as titles with “simple” content, such as BoJack Horseman, will now be streamed at a higher resolution for the same bitrate. When the available bandwidth is adequate for high bitrate encodes, per-title encoding will often give you even better video quality for complex titles, such as Marvel's Daredevil, because we will encode at a higher maximum bitrate than our current recipe. Our continuous innovation on this front recognizes the importance of providing an optimal viewing experience for our members while simultaneously using less bandwidth and being better stewards of the Internet.

by Anne Aaron, Zhi Li, Megha Manohara, Jan De Cock and David Ronca

↧

An update to our Windows 10 app

December 16, 2015, 6:00 am

≫ Next: HTML5 Video is now supported in Firefox

≪ Previous: Per-Title Encode Optimization

We have published a new version of our Windows 10 app to the Windows Store. This update features an updated user experience that is powered by an entirely new implementation on the Universal Windows Platform.

The New User Experience

The updated Browse experience provides vertical scrolling through categories and horizontal scrolling through the items in a category.

The updated Details view features large, cinematic artwork for the show or movie. The Details view for a Show includes episode information while the Details view for a Movie includes suggestions for other content.

Our members on Windows run across many different screen sizes, resolutions and scaling factors. The new version of the application uses a responsive layout to optimize the size and placement of items based on the window size and scaling factor.

Since many Windows 10 devices support touch input on integrated displays or via gestures on their trackpad, we have included affordances for both in this update. When a member is browsing content with a mouse, paginated scrolling of the content within a row is enabled by buttons on the ends of the row. When a member is using touch on an integrated display or via gestures on their trackpad, inertial scrolling of rows is enabled via swipe gestures.

Using the Universal Windows Platform

Over the last few years we have launched several applications for Windows and Windows Phone. The applications were built from a few code bases that span several technologies including Silverlight, XAML, C#. Bringing new features to our members on Windows platforms has required us to make changes in several code bases and ship multiple application updates.

With the Universal Windows Platform, we’re able to build an application from a single code base and run on many Windows 10 devices. Although the initial release of this application supports desktops, laptops and tablets running Windows 10, we have run our application on other Windows 10 devices and we will be adding support for phones running Windows 10 in the near future.

This new version of the application is a javascript based implementation that utilizes Microsoft’s WinJS library. Like several other teams (see tech blog posts for Netflix Likes React and Making Netflix.com Faster) at Netflix, we chose to use Facebook React. Using javascript to build the application has allowed us to use the same HTML5 video playback engine that is used in our browser based applications.

Windows Features

The new version of the app continues to support two features that are unique to the Windows platform.

When the app is pinned to the Start menu, the application tile is a Live Tile that will show artwork representing the items in a member’s Continue Watching list. We support several Live Tile sizes and large tile size is a new addition in this version of the app.

Our integration with Cortana enables members to search with voice commands. On app start, we register our supported commands with Cortana. Once a member has signed-in, they can issue one of our supported commands to Cortana. Here are the Cortana commands that we support in English:

If a user were to tell Cortana: Netflix find Jessica Jones, Cortana would start the app (if needed) and perform a search for Jessica Jones.

What’s Next?

We’re excited to share this update with our members and we’re hard at work on a new set of features and enhancements. The Universal Windows Platform will enable us to support phones running Windows 10 in the near future.

Visit the Windows Store to get the app today!

by Sean Sharma

↧

HTML5 Video is now supported in Firefox

December 17, 2015, 9:11 am

≫ Next: Dynomite with Redis on AWS - Benchmarks

≪ Previous: An update to our Windows 10 app

Today we’re excited to announce the availability of our HTML5 player in Firefox! Windows support is rolling out this week, and OS X support will roll out next year.

Firefox ships with the very latest versions of the HTML5 Premium Video Extensions. That includes the Media Source Extensions (MSE), which enable our video streaming algorithms to adapt to your available bandwidth; the Encrypted Media Extensions (EME), which allows for the viewing of protected content; and the Web Cryptography API (WebCrypto), which implements the cryptographic functions used by our open source Message Security Layer client-server protocol.

We worked closely with Mozilla and Adobe throughout development. Adobe supplies a content decryption module (CDM) that powers the EME API and allows protected content to play. We were pleased to find through our joint field testing that Adobe Primetime's CDM, Mozilla’s<video> tag, and our player all work together seamlessly to provide a high quality viewing experience in Firefox. With the new Premium Video Extensions, Firefox users will no longer need to take an extra step of installing a plug-in to watch Netflix.

We’re gratified that our HTML5 player support now extends to the latest versions of all major browsers, including Firefox, IE, Edge, Safari, and Chrome. Upgrade today to the latest version of your browser to get our best-in-class playback experience.

↧

Dynomite with Redis on AWS - Benchmarks

January 14, 2016, 11:40 am

≫ Next: Automated Failure Testing

≪ Previous: HTML5 Video is now supported in Firefox

About a year ago the Cloud Database Engineering (CDE) team published a post Introducing Dynomite. Dynomite is a proxy layer that provides sharding and replication and can turn existing non-distributed datastores into a fully distributed system with multi-region replication. One of its core features is the ability to scale a data store linearly to meet rapidly increasing traffic demands. Dynomite also provides high availability, and was designed and built to support Active-Active Multi-Regional Resiliency.

Dynomite, with Redis is now utilized as a production system within Netflix. This post is the first part of a series that pragmatically examines Dynomite's use cases and features. In this post, we will show performance results using Amazon Web Services (AWS) with and without the recently added consistency feature.

Dynomite Consistency

Dynomite extends eventual consistency to tunable consistency in the local region. The consistency level specifies how many replicas must respond to a write or read request before returning data to the client application. Read and write consistency can be configured to manage availability versus data accuracy. Consistency can be configured for read or write operations separately (cluster-wide). There are two configurations:

DC_ONE: Reads and writes are propagated synchronously only to the token owner in the local Availability Zone (AZ) and asynchronously replicated to other AZs and regions.
DC_QUORUM: Reads and writes are propagated synchronously to quorum number of nodes in the local region and asynchronously to the rest. The DC_QUORUM configuration writes to the number of nodes that make up a quorum. A quorum number is calculated by the formula ceiling((n+1)/2) where n is the number of nodes in a region. The operation succeeds if the read/write succeeded on a quorum number of nodes.

Test Setup

Client (workload generator) Cluster

For the workload generator, we used an internal Netflix tool called Pappy. Pappy is well integrated with other Netflix OSS services such as (Archaius for fast properties, Servo for metrics, and Eureka for discovery). However, any other other distributed load generator with Redis client plugin can be used to replicate the results. Pappy has support for modules, and one of them is Dyno Java client.

Dyno client uses topology aware load balancing (Token Aware) to directly connect to a Dynomite coordinator node that is the owner of the specified data. Dyno also uses zone awareness to send traffic to Dynomite nodes in the local ASG. To get full benefit of a Dynomite cluster a) the Dyno client cluster should be deployed across all ASGs, so all nodes can receive client traffic, and b) the number of client application nodes per ASG must be larger than the corresponding number of Dynomite nodes in the respective ASG so that the cumulative network capacity of the client cluster is at least equal to the corresponding one at the Dynomite layer.

Dyno also uses connection pooling for persistent connections to reduce the connection churn to the Dynomite nodes. However, in performance benchmarks tuning Dyno can tricky as the workload generator make become the bottleneck due to thread contention. In our benchmark, we observed the delay metrics to pick up a connection from the connection pool that Dyno exposes.

Client: Dyno Java client, using default configuration (token aware + zone aware)
Number of nodes: Equal to the number of Dynomite nodes in each experiment.
Region: us-west-2 (us-west-2a, us-west-2b and us-west-2c)
EC2 instance type: m3.2xlarge (30GB RAM, 8 CPU cores, Network throughput: high)
Platform: EC2-Classic
Data size: 1024 Bytes
Number of Keys: 1M random keys
Demo application used a simple workload of just key value pairs for read and writes i.e the Redis GET and SET api.
Read/Write ratio: 80:20 (the OPS was variable per test, but the ratio was kept 80:20)
Number of readers/writers: 80:20 ratio of reader to writer threads. 32 readers/8 writers per Dynomite Node. We performed some experiments varying the number of readers and writers and found that in the context of our experiments, 32 readers and 8 writes per dynomite node gave the best throughput latency tradeoff.

Dynomite Cluster

Dynomite: Dynomite 0.5.6
Data store: Redis2.8.9
Operating system: Ubuntu 14.04.2 LTS
Number of nodes: 3-48 (doubling every time)
Region: us-west-2 (us-west-2a, us-west-2b and us-west-2c)
EC2 instance type: r3.2xlarge (61GB RAM, 8 CPU cores, Network throughput: high)
Platform: EC2-Classic

The test was performed in a single region with 3 availability zones. Note that replicating to two other availability zones is not a requirement for Dynomite, but rather a deployment choice for high availability at Netflix. A Dynomite cluster of 3 nodes means that there was 1 node per availability zone. However all three nodes take client traffic as well as replication traffic from peer nodes. Our results were captured using Atlas. Each experiment was run 3 times, and our results are averaged over based on average of these times. Each run lasted 3h.

For the our benchmarks, we refer to the Dynomite node, as the node that contains the Dynomite layer and Redis. Hence we do not distinguish on whether Dynomite layer or Redis contributes to the average latency.

Linear Scale Test with DC_ONE

The graphs indicate that Dynomite can scale horizontally in terms of throughput. Therefore, it can handle more traffic by increasing the number of nodes per region. On a per node basis with r3.2xlarge nodes, Dynomite can fully process the traffic generated by the client workload generator (i.e. 32K reads OPS and 8K OPS on a per node basis). For a 1KB payload, as the above graphs show, the main bottleneck is the network (1Gbps EC2 instances). Therefore, Dynomite can potentially provide even faster throughput, if r3.4xlarge (2Gbps) or r3.8xlarge (10Gbps) EC2 instances are used. However, we need to note that 10Gbps optimizations will only be effective when the instances are launched in Amazon's VPC with instance types that can support Enhanced Networking using single root I/O virtualization (SR-IOV).

The average and median latency values show that Dynomite can provide sub-millisecond average latency to the client application. More specifically, Dynomite does not add extra latency as it scales to higher number of nodes, and therefore higher throughput. Overall, the Dynomite node contributes around 20% of the average latency, and the rest of it is a result of the network latency and client processing latency.

At the 95th percentile Dynomite's latency is 0.4ms and does not increase as we scale the cluster up/down. More specifically, the network and client is the major reason for the 95th percentile latency, as Dynomite's node effect is <10%.

It is evident from the 99th percentile graph that the latency for Dynomite pretty much remains the same while the client side increases indicating the variable nature of the network between the clusters.

Linear Scale Test With DC_QUORUM

The test setup was similar to what we used for DC_ONE tests above. Consistency was set to DC_QUORUM for both reads and writes on all Dynomite nodes. In DC_QUORUM, our expectations are that throughput will reduce and latency will increase because Dynomite waits for quorum number of responses.

Looking at the above graph it is clear that dynomite still scales well as the cluster nodes are increasing. Moreover Dynomite node achieves 18K OPS per node in our setup, when the cluster spans a single region. In comparison, Dynomite can achieve 40K OPS per node in DC_ONE.

The average and median latency remains <2.5ms even when DC_QUORUM consistency is enabled in Dynomite nodes. The average and median latency are slightly higher than the corresponding experiments with DC_ONE. In DC_ONE, the dynomite co-ordinator only waits for the local zone node to respond. In DC_QUORUM, the coordinator waits for quorum nodes to respond. Hence, in the overall read and write latency formula, the latency of the network hop to the other ASG, and the latency of performing the corresponding operation on those nodes in other ASGs must be included.

The 95th percentile at the Dynomite level is less than 2ms even after increasing the traffic on each Dyno client node (linear scale). At the client side it remains below 3ms.

At the 99th Percentile with DC_QUORUM enabled, Dynomite produces less than 3ms of latency. When considering the network from the cluster to the client, the latency remains well below 5ms opening the door for a number of applications that require consistency with low latency. Dynomite can report 90th percentile and 99.9th percentile through the statistics port. For brevity, we have decided to present the 99th percentile only.

Pipelining

Redis Pipelining is client side batching that is also supported by Dynomite; the client application sends requests without waiting for a response from a previous request and later reads a single response for the whole batch. Pipelining makes it possible to increase the overall throughput at the expense of additional latency for individual operations. In the following experiments, the Dyno client randomly selected between 3 to 10 operations in one pipeline request. We believe that this configuration might be close to how a client application would use the Redis Pipelining. The experiments were performed for both DC_ONE and DC_QUORUM.

For comparison reason, we showcase both the non-pipelining and pipelining results. In our tests, pipelining increased the throughput upto 50%. For a small Dynomite cluster the improvement is larger, but as Dynomite horizontally scales the benefit of pipelining decreases.

Latency is a factor of how many requests are combined into one pipeline request so it will vary and will be higher than non pipelined requests.

Conclusion

We performed the tests to get some more insights about Dynomite using Redis at the data store layer, and how to size our clusters. We could have achieved better results with better instance types both at the client and Dyomite server cluster. For example, adding Dynomite nodes with better network capacity (especially the ones supporting enhanced Networking on Linux Instances in a VPC) could further increase the performance of our clusters.

Another way to improve the performance is by using fewer availability zones. In that case, Dynomite would replicate the data in one more availability zone instead of two more, hence more bandwidth would have been available to client connections. In our experiment we used 3 availability zones in us-west-2, which is a common deployment in most production clusters at Netflix.

In summary, our benchmarks were based on instance types and deployments that are common at Netflix and Dynomite. We presented results that indicate that DC_QUORUM provides better read and write guarantees to the client but with higher latencies and lower throughput. We also showcased how a client can configure Redis Pipeline and benefit from request batching.

We briefly mentioned the availability of higher consistency in this article. In the next article we'll dive deeper into how we implemented higher consistency and how we handle anti-entropy.

by: Shailesh Birari, Jason Cacciatore, Minh Do, Ioannis Papapanagiotou, Christos Kalantzis

↧

Automated Failure Testing

January 20, 2016, 10:20 am

≫ Next: Astyanax - Retiring an old friend

≪ Previous: Dynomite with Redis on AWS - Benchmarks

AKA Training Smarter Monkeys

At Netflix, we have found that proactive failure testing is a great way to ensure that we have a reliable product for our members by helping us prepare our systems, and our teams, for the problems that arise in our production environment. Our various efforts in this space, some of which are manual, have helped us make it through the holiday season without incident (which is great if you’re on-call for New Year’s Eve!). But who likes manual processes? Additionally, we are only testing for the failures we anticipate, and often only for an individual service or component per exercise. We can do better!

Imagine a monkey that crawls through your code and infrastructure, injecting small failures and discovering if it results in member pain.

While looking for a way to build such a monkey, we discovered a failure testing approach developed by Peter Alvaro called Molly. Given that we already had a failure injection service called FIT, we believed we could build a prototype implementation in short order. And we thought it would be great to see how well the concepts outlined in the Molly paper translated into a large-scale production environment. So, we got in touch with Peter to see if he was interested in working together to build a prototype. He was and the results of our collaboration are detailed below.

Algorithm

“A lineage-driven fault injector reasons backwards from correct system outcomes to determine whether failures in the execution could have prevented the outcome.” [1]

Molly begins by looking at everything that went into a successful request and asking “What could have prevented this outcome?” Take this simplified request as an example:

(A or R or P or B)

At the start, everything is necessary - as far as we know. Symbolically we say that member pain could result from failing (A or R or P or B) where A stands for API, etc. We start by choosing randomly from the potential failure points and rerunning the request, injecting failure at the chosen point.

There are three potential outcomes:

The request fails - we’ve found a member facing failure

From this we can prune future experiments containing this failure

The request succeeds - the service/failure point is not critical
The request succeeds, and there is an alternative interaction that takes the place of the failure (i.e. a failover or a fallback).

In this example, we fail Ratings and the request succeeds, producing this graph:

(A or P or B) and (A or P or B or R)

We know more about this request’s behavior and update our failure equation. As Playlist is a potential failure point in this equation, we’ll fail it next, producing this graph:

(A or PF or B) and (A or P or B) and (A or P or B or R)

This illustrates #3 above. The request was still successful, but due to an alternate execution. Now we have a new failure point to explore. We update our equation to include this new information. Now we rinse, lather, and repeat until there are no more failures to explore.

Molly isn’t prescriptive on how to explore this search space. For our implementation we decided to compute all solutions which satisfy the failure equation, and then choose randomly from the smallest solution sets. For example, the solutions to our last representation would be: [{A}, {PF}, {B}, {P,PF}, {R,A}, {R,B} …]. We would begin by exploring all the single points of failure: A, PF, B; then proceed to all sets of size 2, and so forth.

Implementation

Lineage

What is the lineage of a Netflix request? We are able to leverage our tracing system to build a tree of the request execution across our microservices. Thanks to FIT, we have additional information in the form of “Injection Points”. These are key inflection points in our system where failures may occur. Injection Points include things like Hystrix command executions, cache lookups, DB queries, HTTP calls, etc. The data provided by FIT allows us to build a more complete request tree, which is what we feed into the algorithm for analysis.

In the examples above, we see simple service request trees. Here is the same request tree extended with FIT data:

Success

What do we mean by ‘success’? What is most important is our members’ experience, so we want a measurement that reflects this. To accomplish this, we tap into our device reported metrics stream. By analyzing these metrics we can determine if the request resulted in a member-facing error.

An alternate, more simplistic approach could be to rely on the HTTP status codes for determining successful outcomes. But status codes can be misleading, as some frameworks return a ‘200’ on partial success, with a member-impacting error embedded within the payload.

Currently only a subset of Netflix requests have corresponding device reported metrics. Adding device reported metrics for more request types presents us with the opportunity to expand our automated failure testing to cover a broader set of device traffic.

Idempotence

Being able to replay requests made things nice and clean for Molly. We don’t have that luxury. We don’t know at the time we receive a request whether or not it is idempotent and safe to replay. To offset this, we have grouped requests into equivalence classes, such that requests within each class ‘behave’ the same - i.e. executes the same dependent calls and fail in the same way.

To define request classes, we focused on the information we had available when we received the request: the path (netflix.com/foo/bar), the parameters (?baz=boo), and the device making the request. Our first pass was to see if a direct mapping existed between these request features and the set of dependencies executed. This didn’t pan out. Next we explored using machine learning to find and create these mappings. This seemed promising, but would require a fair amount of work to get right.

Instead, we narrowed our scope to only examine requests generated by the Falcor framework. These requests provide, through the query parameters, a set of json paths to load for the request, i.e. ‘videos’, ‘profiles’, ‘images’. We found that these Falcor path elements matched consistently with the internal services required to load those elements.

Future work involves finding a more generic way to create these request class mappings so that we can expand our testing beyond Falcor requests.

These request classes change as code is written and deployed by Netflix engineers. To offset this drift, we run an analysis of potential request classes daily through a sampling of the device reported metrics stream. We expire old classes that no longer receive traffic, and we create new classes for new code paths.

Member Pain

Remember that the goal of this exploration is to find and fix errors before they impact a large number of members. It’s not acceptable to cause a lot of member pain while running our tests. In order to mitigate this risk, we structure our exploration so that we are only running a small number of experiments over any given period.

Each experiment is scoped to a request class and runs for a short period (twenty to thirty seconds) for a miniscule percentage of members. We want at least ten good example requests from each experiment. In order to filter out false positives, we look at the overall success rate for an experiment, only marking a failure as found if greater than 75% of requests failed. Since our request class mapping isn’t perfect, we also filter out requests which, for any reason, didn’t execute the failure we intended to test.

Let’s say we are able to run 500 experiments in a day. If we are potentially impacting 10 members each run, then the worst case impact is 5,000 members each day. But not every experiment results in a failure - in fact the majority of them result in success. If we only find a failure in one in ten experiments (a high estimate), then we’re actually impacting 500 members requests in a day, some of which are further mitigated by retries. When you’re serving billions of requests each day, the impact of these experiments is very small.

Results

We were lucky that one of the most important Netflix requests met our criteria for exploration - the ‘App Boot’ request. This request loads the metadata needed to run the Netflix application and load the initial list of videos for a member. This is a moment of truth that, as a company, we want to win by providing a reliable experience from the very start.

This is also a very complex request, touching dozens of internal services and hundreds of potential failure points. Brute force exploration of this space would take 2^100 iterations (roughly 1 with 30 zeros following), whereas our approach was able to explore it in ~200 experiments. We found five potential failures, one of which was a combination of failure points.

What do we do once we’ve found a failure? Well, that part is still admittedly manual. We aren’t to the point of automatically fixing the failure yet. In this case, we have a list of known failure points, along with a ‘scenario’ which allows someone to use FIT to reproduce the failure. From this we can verify the failure and decide on a fix.

We’re very excited that we were able to build this proof of concept implementation and find real failures using it. We hope to be able to extend it to search a larger portion of the Netflix request space and find more member facing failures before they result in outages, all in an automated way.

And if you’re interested in failure testing and building resilient systems, get in touch with us - we’re hiring!

Kolton Andrus (@KoltonAndrus), Ben Schmaus (@schmaus)

↧

Astyanax - Retiring an old friend

February 1, 2016, 9:02 am

≫ Next: Distributed Time Travel for Feature Generation

≪ Previous: Automated Failure Testing

In the summer of 2011, Astyanax, an Apache Cassandra (C*) Java client library was created to easily consume Cassandra, which at the time was in its infancy. Astyanax became so popular that for a good while, it became the de facto java client library for the Apache Cassandra community. Astyanax provides the following features:

High level, simple, object oriented interface to Cassandra.
Resilient behavior on the client side.
Connection pool abstraction. Implementation of a round robin and token-aware connection pool.
Monitoring abstraction to get event notification from the connection pool.
Complete encapsulation of the underlying Thrift API and structs.
Automatic retry of downed hosts.
Automatic discovery of additional hosts in the cluster.
Suspension of hosts for a short period of time after several timeouts.
Annotations to simplify use of composite columns.

Datastax, the enterprise company behind Apache Cassandra, took many of the lessons contained within Astyanax and included them within their official Java Cassandra driver.

When Astyanax was written, the protocol to communicate to Cassandra was Thrift and the API was very low level. Today, Cassandra is mostly consumed via a query language very similar to SQL. This new language is called CQL (Cassandra Query Language). The Cassandra community has also moved beyond the Thrift protocol to the CQL BINARY PROTOCOL.

Thrift will be deprecated in Apache Cassandra in version 4.0. Aside from that deprecation there are also the following reasons to move away from Thrift:

CQL Binary protocol performs better
Community development efforts have completely moved to the CQL Binary protocol. The thrift implementation is only in maintenance mode.
CQL is easier to consume since the API resembles SQL.

Today we are moving Astyanax from an active project in the NetflixOSS ecosystem, into an archived state. This means the project will still be available for public consumption, however, we will not be making any feature enhancements or performance improvements. There are still tens of thousands (if not more) lines of code, within Netflix, that use Astyanax. Moving forward, we will only be fixing Netflix critical bugs as we begin our efforts to refactor our internal systems to use the CQL Binary protocol.

If there are members of the community that would like to have a more hands-on role and maintain the project by becoming a committer, please reach out to me directly.

by: Christos Kalantzis

↧

Distributed Time Travel for Feature Generation

February 12, 2016, 9:57 am

≫ Next: Evolution of the Netflix Data Pipeline

≪ Previous: Astyanax - Retiring an old friend

We want to make it easy for Netflix members to find great content to fulfill their unique tastes. To do this, we follow a data-driven algorithmic approach based on machine learning, which we have described in past posts and other publications. We aspire to a day when anyone can sit down, turn on Netflix, and the absolute best content for them will automatically start playing. While we are still way off from that goal, it sets a vision for us to improve the algorithms that span our service: from how we rank videos to how we construct the homepage to how we provide search results. To make our algorithms better, we follow a two-step approach. First, we try an idea offline using historical data to see if it would have made better recommendations. If it does, we then deploy a live A/B test to see if it performs well in reality, which we measure through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention.

While there are many ways to improve machine learning approaches, arguably the most critical is to provide better input data. A model can only be as good as the data we give it. Thus, we spend a lot of time experimenting with new kinds of input signals for our models. Most machine learning models expect input to be represented as a vector of numbers, known as a feature vector. Somehow we need to take an arbitrary input entity (e.g. a tuple of member profile, video, country, time, device, etc.), with its associated, richly structured data, and provide a feature vector representing that entity for a machine learning algorithm to use. We call this transformation feature generation and it is central to providing the data needed for learning. Examples features include how many minutes a member has watched a video, the popularity of the video, its predicted rating, what genre a video belongs to, or how many videos are in a row. We use the term feature broadly, since a feature could be a simple indicator or have a full model behind it, such as a Matrix Factorization.

We will describe how we built a time machine for feature generation using Apache Spark that enables our researchers to quickly try ideas for new features on historical data such that running offline experiments and transitioning to online A/B tests is seamless.

Why build a time machine?

There are many ways to approach feature generation, several of which we’ve used in the past. One way is to use logged event data that we store on S3 and access via Hive by running queries on these tables to define features. While this is flexible for exploratory analysis, it has several problems. First, to run an A/B test we need the feature calculation to run within our online microservice architecture. We run the models online because we know that freshness and responsiveness of our recommendations is important to the member experience. This means we would need to re-implement feature generation to retrieve data from online services instead of Hive tables. It is difficult to match two such implementations exactly, especially since any discrepancies between offline and online data sources can create unexpected differences in the model output. In addition, not all of our data is available offline, particularly output of recommendation models, because these involve a sparse-to-dense conversion that creates a large volume of data.

On the other extreme, we could log our features online where a model would be used. While this removes the offline/online discrepancy and makes transitioning to A/B test easy, it means we need to deploy each idea for a new feature into production and wait for the data to collect before we can determine if a feature is useful. This slows down the iteration cycle for new ideas. It also requires that all the data for a feature to be available online, which could mean building new systems to serve that data, again before we have determined if it is valuable. We also need to compute features for many more members or requests than we may actually need for training based on how we choose label data.

We’ve also tried a middle ground where we use feature code that calls online services, such as the one that provides viewing history, and filters out all the data with timestamps past a certain point in time. However, this only works for situations where a service records a log of all historical events; services that just provide the current state cannot be used. It also places additional load on the online services each time we generate features.

Throughout these approaches, management of time is extremely important. We want an approach that balances the benefits of all the above approaches without the drawbacks. In particular, we want a system that:

Enables quick iteration from idea to modeling to running an A/B test
Uses the data provided by our online microservices, without overloading them
Accurately represents input data for a model at a point in time to simulate online use
Handles our scale of data with many researchers running experiments concurrently, without using more than 1.21 gigawatts of power
Works well in an interactive environment, such as using a notebook for experimentation, and also reliably in a batch environment, such as for doing periodic retraining
Should only need to write feature code once so that we don’t need to spend time verifying that two implementations are exactly equivalent
Most importantly, no paradoxes are allowed (e.g. the label can’t be in the features)

When faced with tough problems one often wishes for a time machine to solve them. So that is what we decided to build. Our time machine snapshots online services and uses the snapshot data offline to reconstruct the inputs that a model would have seen online to generate features. Thus, when experimenters design new feature encoders — functions that take raw data as input and compute features — they can immediately use them to compute new features for any time in the past, since the time machine can retrieve the appropriate snapshots and pass them to the feature encoders.

How to build a Time Machine

Here are the various components needed in a time machine that snapshots online services:

Select contexts to snapshot
Snapshot data of various micro services for the selected context
Build APIs to serve this data for a given time coordinate in the past

Context Selection

Snapshotting data for all contexts (e.g all member profiles, devices, times of day) would be very expensive. Instead, we select samples of contexts to snapshot periodically (typically daily), though different algorithms may need to train on different distributions. For example, some use stratified samples based on properties such as viewing patterns, devices, time spent on the service, region, etc. To handle this, we use Spark SQL to select an appropriate sample of contexts for each experiment from Hive. We merge the context set across experiments and persist it into S3 along with the corresponding experiment identifiers.

Data Snapshots

The next component in the time machine fetches data from various online services and saves a snapshot of the returned data for the selected contexts. Netflix embraces a fine-grained Service Oriented Architecture for our cloud-based deployment model. There are hundreds of such micro services that are collectively responsible for handling the member experience. Data from various such services like Viewing History, My List, and Predicted Ratings are used as input for the features in our models.

We use Netflix-specific components such as Eureka, Hystrix, and Archaius to fetch data from online services through their client libraries. However, some of these client libraries bulk-load data, so they have a high memory footprint and a large startup time. Spark is not well suited for loading such components inside its JVM. Moreover, the requirement of creating an uber jar to run Spark jobs can cause runtime jar incompatibility issues with other Netflix libraries. To alleviate this problem, we used Prana, which runs outside the Spark JVM, as a data proxy to the Netflix ecosystem.

Spark parallelizes the calls to Prana, which internally fetches data from various micro services for each of these contexts. We chose Thrift as the binary communication protocol between Spark and Prana. We store the snapshotted data in S3 using Parquet, a compressed column-oriented binary format, for both time and space efficiency, and persist the location of the S3 data in Cassandra.

Ensuring pristine data quality of these snapshots is critical for us to correctly evaluate our models. Hence, we store the confidence level for each snapshot service, which is the percentage of successful data fetches from the micro services excluding any fallbacks due to timeouts or service failures. We expose it to our clients, who can chose to use this information for their experimentation.

For both snapshotting and context selection, we needed to schedule several Spark jobs to run on a periodic basis, with dependencies between them. To that end, we built a general purpose workflow orchestration and scheduling framework called Meson, which is optimized for machine learning pipelines, and used it to run the Spark jobs for the components of the time machine. We intend to open source Meson in the future and will provide more detail about it in an upcoming blog post.

APIs for Time Travel

We built APIs that enable time travel and fetch the snapshot data from S3 for a given time in the past. Here is a sample API to get the snapshot data for the Viewing History service.

Given a destination time in the past, the API fetches the associated S3 location of the snapshot data from Cassandra and loads the snapshot data in Spark. In addition, when given an A/B test identifier, the API filters the snapshot data to return only those contexts selected for that A/B test. The system transforms the snapshot data back into the respective services’ Java objects (POJOs) so that the feature encoders operate on the exact same POJOs for both offline experimentation and online feature generation in production.

The following diagram shows the overall architecture of the time machine and where Spark is used in building it: from selecting members for experimentation, snapshotting data of various services for the selected members, to finally serving the data for a time in the past.

DeLorean: Generating Features via Time Travel

DeLorean is our internal project to build the system that takes an experiment plan, travels back in time to collect all the necessary data from the snapshots, and generates a dataset of features and labels for that time in the past to train machine learning models. Of course, the first step is to select the destination time, to bring it up to 88 miles per hour, then DeLorean takes care of the rest.

Running an Experiment

DeLorean allows a researcher to run a wide range of experiments by automatically determining how to launch the time machine, what time coordinates are needed, what data to retrieve, and how to structure the output. Thus, to run a new experiment, an experimenter only needs to provide the following:

Label data: A blueprint for obtaining a set of contexts with associated time coordinates, items, and labels for each. This is typically created by a Hive, Pig, or Spark SQL query
A feature model containing the required feature encoder configurations
Implementations of any new feature encoders that do not already exist in our library

DeLorean provides a capability for writing and modifying a new feature encoder during an experiment, for example, in a Zeppelin Notebook or in Spark Shell, so that it can be used immediately for feature generation. If we find that new feature encoder useful, we can later productionize it by adding it to our library of feature encoders.

The high-level process to generate features is depicted in the following diagram, where the blocks highlighted in light green are typically customized for new experiments. In this scenario, experimenters can also implement new feature encoders that are used in conjunction with existing ones.

DeLorean image by JMortonPhoto.com& OtoGodfrey.com

Label Data and Feature Encoders

One of the primary inputs to DeLorean is the labeldata, which contains information about the contexts, items, and associated labels for which to generate features. The contexts, as its name suggests, can be describe the setting for where a model is to be used (e.g. tuples of member profiles, country, time, device, etc.). Items are the elements which are to be trained on, scored, and/or ranked (e.g. videos, rows, search entities). Labels are typically the targets used in supervised learning for each context-item combination. For unsupervised learning approaches, the label is not required. As an example, for personalized ranking the context could be defined as the member profile ID, country code, and time, whereas the item as the video, and the labels as plays or non-plays. In this example, the label data is created by joining the set of snapshotted contexts to the logged play actions.

Once we have this label dataset, we need to compute features for each context-item combination in the dataset by using the desired set of feature encoders. Each feature encoder takes a context and each of target items associated with the context, together with some raw data elements in the form of POJOs, to compute one or more features.

Each type of item, context variable or data element, has a data key associated with it. Every feature encoder has a method that returns the set of keys for the data it consumes. DeLorean uses these keys to identify the required data types, retrieves the data, and passes it to the feature encoder as a data map— which is a map from data keys to data objects.

We made DeLorean flexible enough to allow the experiments to use different types of contexts and items without needing to customize the feature generation system. DeLorean can be used not only for recommendations, but also for a row ordering experiment which has profile-device tuple as context and rows of videos as items. Another use case may be a search experiment which has the query-profile-country tuple as context and individual videos as items. To achieve this, DeLorean automatically infers the type of contexts and items from the label data and the data keys required by the feature encoders.

Data Elements

Data elements are the ingredients that get transformed into features by a feature encoder. Some of these are context-dependent, such as viewing history for a profile, and others are shared by all contexts, such as metadata of the videos. We handle these two types of data elements differently.

For context-dependent data elements, we use the snapshots described above, and associate each one with a data key. We bring all the required snapshot data sources together with the values, items, and labels for each context, so that the data for a single context is sent to a single Spark executor. Different contexts are broken up to enable distributed feature generation. The snapshots are loaded as an RDD of (context, Map(data key -> data element)) in a lazy fashion and a series of joins between the label data and all the necessary context-dependent data elements are performed using Spark.

For context-independent data elements, DeLorean broadcasts these bulk data elements to each executor. Since these data elements have manageable sizes and often have a slow rate of change over time, we keep a record of each update that we use to rewind back to the appropriate previous version. These are kept in memory as singleton objects and made available to the feature generators for each context processed by an executor. Thus, a complete data map is created for each context containing the context data, context-dependent snapshot data elements, and shared data singletons.

Once the features are generated in Spark, the data is represented as a Spark DataFrame with an embedded schema. For many personalization application, we need to rank a number of items for each context. To avoid shuffling in the ranking process, item features are grouped by context in the output. The final features are stored in Hive using a Parquet format.

Model Training, Validation, and Testing

We use features generated using our time machine to train the models that we use in various parts of our recommendation systems. We use a standardized schema for passing the DataFrames of training features to machine learning algorithms, as well as computing predictions and metrics for trained models on the validation and test feature DataFrames. We also standardized a format to serialize the models that we use for publishing the models to be later consumed by online applications or in other future experiments.

The following diagram shows how we run a typical machine learning experiment. Once the experiment is designed, we collect the dataset of contexts, items, and labels. Next the features for the label dataset are generated. We then train models using either single machine, multi-core, or distributed algorithms and perform parameter tuning by computing metrics on a validation set. Then we pick the best models and compare them on a testing set. When we see a significant improvement in the offline metrics over the production model and that the outputs are different enough, we design an A/B test using variations of the model and run it online. If the A/B test shows a statistically significant increase in core metrics, we roll it out broadly. Otherwise, we learn from the results to iterate on the next idea.

Going Online

One of the primary motivations for building DeLorean is to share the same feature encoders between offline experiments and online scoring systems to ensure that there are no discrepancies between the features generated for training and those computed online in production. When an idea is ready to be tested online, the model is packaged with the same feature configuration that was used by DeLorean to generate the features.

To compute features in the production system, we directly call our online microservices to collect the data elements required by all the feature encoders used in a model, instead of obtaining them from snapshots as we do offline. We then assemble them into data maps and pass them to the feature encoders. The feature vector is then passed to the offline-trained model for computing predictions, which are used to create our recommendations. The following diagram shows the high-level process of transitioning from an offline experiment to an online production system where the blocks highlighted in yellow are online systems, and the ones highlighted in blue are offline systems. Note that the feature encoders are shared between online and offline to guarantee the consistency of feature generation.

Conclusion and Future work

By collecting the state of the online world at a point in time for a select set of contexts, we were able to build a mechanism for turning back time. Spark’s distributed, resilient computation power enabled us to snapshot millions of contexts per day and to implement feature generation, model training and validation at scale. DeLorean is now being used in production for feature generation in some of the latest A/B tests for our recommender system.

However, this is just a start and there are many ways in which we can improve this approach. Instead of batch snapshotting on a periodic cadence, we can drive the snapshots based on events, for example at a time when a particular member visits our service. To avoid duplicate data collection, we can also capture data changes instead of taking full snapshots each time. We also plan on using the time machine capability for other needs in evaluating new algorithms and testing our systems. Of course, we leave the ability to travel forward in time as future work.

Fast experimentation is the hallmark of a culture of innovation. Reducing the time to production for an idea is a key metric we use to measure the success of our infrastructure projects. We will continue to build on this foundation to bring better personalization to Netflix in our effort to delight members and win moments of truth. If you are interested in these types of time-bending engineering challenges, join us.

By Hossein Taghavi, Prasanna Padmanabhan, DB Tsai, Faisal Zakaria Siddiqi, Justin Basilico

↧

Evolution of the Netflix Data Pipeline

February 15, 2016, 4:21 pm

≫ Next: Recommending for the World

≪ Previous: Distributed Time Travel for Feature Generation

Our new Keystone data pipeline went live in December of 2015. In this article, we talk about the evolution of Netflix’s data pipeline over the years. This is the first of a series of articles about the new Keystone data pipeline.

Netflix is a data-driven company. Many business and product decisions are based on insights derived from data analysis. The charter of the data pipeline is to collect, aggregate, process and move data at cloud scale. Almost every application at Netflix uses the data pipeline.

Here are some statistics about our data pipeline:

~500 billion events and ~1.3 PB per day
~8 million events and ~24 GB per second during peak hours

There are several hundred event streams flowing through the pipeline. For example:

Video viewing activities
UI activities
Error logs
Performance events
Troubleshooting & diagnostic events

Note that operational metrics don’t flow through this data pipeline. We have a separate telemetry system Atlas, which we open-sourced just like many other Netflix technologies.

Over the last a few years, our data pipeline has experienced major transformations due to evolving requirements and technological developments.

V1.0 Chukwa pipeline

The sole purpose of the original data pipeline was to aggregate and upload events to Hadoop/Hive for batch processing. As you can see, the architecture is rather simple. Chukwa collects events and writes them to S3 in Hadoop sequence file format. The Big Data Platform team further processes those S3 files and writes to Hive in Parquet format. End-to-end latency is up to 10 minutes. That is sufficient for batch jobs which usually scan data at daily or hourly frequency.

V1.5 Chukwa pipeline with real-time branch

With the emergence of Kafka and Elasticsearch over the last couple of years, there has been a growing demand for real-time analytics in Netflix. By real-time, we mean sub-minute latency.

In addition to uploading events to S3/EMR, Chukwa can also tee traffic to Kafka (the front gate of real-time branch). In V1.5, approximately 30%of the events are branched to the real-time pipeline. The centerpiece of the real-time branch is the router. It is responsible for routing data from Kafka to the various sinks: Elasticsearch or secondary Kafka.

We have seen explosive growth in Elasticsearch adoption within Netflix for the last two years. There are ~150 clusters totaling ~3,500 instances hosting ~1.3 PB of data. The vast majority of the data is injected via our data pipeline.

When Chukwa tees traffic to Kafka, it can deliver full or filtered streams. Sometimes, we need to apply further filtering on the Kafka streams written from Chukwa. That is why we have the router to consume from one Kafka topic and produce to a different Kafka topic.

Once we deliver data to Kafka, it empowers users with real-time stream processing: Mantis, Spark, or custom applications. “Freedom and Responsibility” is the DNA of Netflix culture. It’s up to users to choose the right tool for the task at hand.

Because moving data at scale is our expertise, our team maintains the router as a managed service. But there are a few lessons we learned while operating the routing service:

The Kafka high-level consumer can lose partition ownership and stop consuming some partitions after running stable for a while. This requires us to bounce the processes.
When we push out new code, sometimes the high-level consumer can get stuck in a bad state during rebalance.
We group hundreds of routing jobs into a dozen of clusters. The operational overhead of managing those jobs and clusters is an increasing burden. We need a better platform to manage the routing jobs.

V2.0 Keystone pipeline (Kafka fronted)

In addition to the issues related to routing service, there are other motivations for us to revamp our data pipeline:

Simplify the architecture.
Kafka implements replication that improves durability, while Chukwa doesn’t support replication.
Kafka has a vibrant community with strong momentum.

There are three major components:

Data Ingestion - There are two ways for applications to ingest data.

use our Java library and write to Kafka directly.
send to an HTTP proxy which then writes to Kafka.

Data Buffering - Kafka serves as the replicated persistent message queue. It also helps absorb temporary outages from downstream sinks.
Data Routing - The routing service is responsible for moving data from fronting Kafka to various sinks: S3, Elasticsearch, and secondary Kafka.

We have been running Keystone pipeline in production for the past few months. We are still evolving Keystone with a focus on QoS, scalability, availability, operability, and self-service.

In follow-up posts, we’ll cover more details regarding:

How do we run Kafka in cloud at scale?
How do we implement routing service using Samza?
How do we manage and deploy Docker containers for routing service?

If building large-scale infrastructure excites you, we are hiring!

Real-Time Data Infrastructure Team

Steven Wu, Allen Wang, Monal Daxini, Manas Alekar, Zhenzhong Xu, Jigish Patel, Nagarjun Guraja, Jonathan Bond, Matt Zimmer, Peter Bakas

↧

Recommending for the World

February 17, 2016, 6:00 am

≫ Next: Caching for a Global Netflix

≪ Previous: Evolution of the Netflix Data Pipeline

#AlgorithmsEverywhere

by Yves Raimond and Justin Basilico

The Netflix experience is driven by a number of Machine Learning algorithms: personalized ranking, page generation, search, similarity, ratings, etc. On the 6th of January, we simultaneously launched Netflix in 130 new countries around the world, which brings the total to over 190 countries. Preparing for such a rapid expansion while ensuring each algorithm was ready to work seamlessly created new challenges for our recommendation and search teams. In this post, we highlight the four most interesting challenges we’ve encountered in making our algorithms operate globally and, most importantly, how this improved our ability to connect members worldwide with stories they'll love.

Challenge 1: Uneven Video Availability

Before we can add a video to our streaming catalog on Netflix, we need to obtain a license for it from the content owner. Most content licenses are region-specific or country-specific and are often held to terms for years at a time. Ultimately, our goal is to let members around the world enjoy all our content through global licensing, but currently our catalog varies between countries. For example, the dystopian Sci-Fi movie “Equilibrium” might be available on Netflix in the US but not in France. And “The Matrix” might be available in France but not in the US. Our recommendation models rely heavily on learning patterns from play data, particularly involving co-occurrence or sequences of plays between videos. In particular, many algorithms assume that when something was not played it is a (weak) signal that someone may not like a video, because they chose not to play it. However, in this particular scenario we will never observe any members who played both “Equilibrium” and “The Matrix”. A basic recommendation model would then learn that these two movies do not appeal to the same kinds of people just because the audiences were constrained to be different. However, if these two movies were available to the same set of members, we would likely observe a similarity between the videos and between the members who watch them. From this example, it is clear that uneven video availability potentially interferes with the quality of our recommendations.

Our search experience faces a similar challenge. Given a (partial) query from a member, we want to present the most relevant videos in the catalog. However, not accounting for availability differences reduces the quality of this ranking. For example, the top results for a given query from a ranking algorithm unaware of availability differences could include a niche video followed by a well-known one in a case where the latter is only available to a relatively small number of our global members and the former is available much more broadly.

Another aspect of content licenses is that they have start and end dates, which means that a similar problem arises not only across countries, but also within a given country across time. If we compare a well-known video that has only been available on Netflix for a single day to another niche video that was available for six months, we might conclude that the latter is a lot more engaging. However, if the recently added, well-known video had instead been on the site for six months, it probably would have more total engagement.

One can imagine the impact these issues can have on more sophisticated search or recommendation models when they already introduce a bias in something as simple as popularity. Addressing the issue of uneven availability across both geography and time lets our algorithms provide better recommendations for a video already on our service when it becomes available in a new country.

So how can we avoid learning catalog differences and focus on our real goal of learning great recommendations for our members? We incorporate into each algorithm the information that members have access to different catalogs based on geography and time, for example by building upon concepts from the statistical community on handling missing data.

Challenge 2: Cultural Awareness

Another key challenge in making our algorithms work well around the world is to ensure that we can capture local variations in taste. We know that even with the same catalog worldwide we would not expect a video to have the exact same popularity across countries. For example, we expect that Bollywood movies would have a different popularity in India than in Argentina. However, should two members get similar recommendations, if they have similar profiles but if one member lives in India and the other in Argentina? Perhaps if they are both watching a lot of Sci-Fi, their recommendations should be similar. Meanwhile, overall we would expect Argentine members should be recommended more Argentine Cinema and Indian members more Bollywood.

An obvious approach to capture local preferences would be to build models for individual countries. However, some countries are small and we will have very little member data available there. Training a recommendation algorithm on such sparse data leads to noisy results, as the model will struggle to identify clear personalization patterns from the data. So we need a better way.

Prior to our global expansion, our approach was to group countries into regions of a reasonable size that had a relatively consistent catalog and language. We would then build individual models for each region. This could capture the taste differences between regions because we trained separate models whose hyperparameters were tuned differently. Within a region, as long as there were enough members with certain taste preference and a reasonable amount of history, a recommendation model should be able to identify and use that pattern of taste. However, there were several problems with this approach. The first is that within a region the amount of data from a large country would dominate the model and dampen its ability to learn the local tastes for a country with a smaller number of members. It also presented a challenge of how to maintain the groupings as catalogs changed over time and memberships grew. Finally, because we’re continuously running A/B tests with model variants across many algorithms, the combinatorics involving a growing number of regions became overwhelming.

To address these challenges we sought to combine the regional models into a single global model that also improves the recommendations we make, especially in countries where we may not yet have many members. Of course, even though we are combining the data, we still need to reflect local differences in taste. This leads to the question: is local taste or personal taste more dominant? Based on the data we’ve seen so far, both aspects are important, but it is clear that taste patterns do travel globally. Intuitively, this makes sense: if a member likes Sci-Fi movies, someone on the other side of the world who also likes Sci-Fi would be a better source for recommendations than their next-door neighbor who likes food documentaries. Being able to discover worldwide communities of interest means that we can further improve our recommendations, especially for niche interests, as they will be based on more data. Then with a global algorithm we can identify new or different taste patterns that emerge over time.

To refine our models we can use many signals about the content and about our members. In this global context, two important taste signals could be language and location. We want to make our models aware of not just where someone is logged in from but also aspects of a video such as where it is from, what language it is in, and where it is popular. Going back to our example, this information would let us offer different recommendations to a brand new member in India as compared to Argentina, as the distribution of tastes within the two countries is different. We expand on the importance of language in the next section.

Challenge 3: Language

Netflix has now grown to support 21 languages and our catalog includes more local content than ever. This increase creates a number of challenges, especially for the instant search algorithm mentioned above. The key objective of this algorithm is to help every member find something to play whenever they search while minimizing the number of interactions. This is different than standard ranking metrics used to evaluate information retrieval systems, which do not take the amount of interaction into account. When looking at interactions, it is clear that different languages involve very different interaction patterns. For example, Korean is usually typed using the Hangul alphabet where syllables are composed from individual characters. For example, to search for “올드보이” (Oldboy), in the worst possible case, a member would have to enter nine characters: “ㅇ ㅗ ㄹㄷ ㅡ ㅂ ㅗ ㅇㅣ”. Using a basic indexing for the video title, in the best case a member would still need to type three characters: “ㅇ ㅗ ㄹ”, which would be collapsed in the first syllable of that title: “올”. In a Hangul-specific indexing, a member would need to write as little as one character: “ㅇ”. Optimizing for the best results with the minimum set of interactions and automatically adapting to newly introduced languages with significantly different writing systems is an area we’re working on improving.

Another language-related challenge relates to recommendations. As mentioned above, while taste patterns travel globally, ultimately people are most likely to enjoy content presented in a language they understand. For example, we may have a great French Sci-Fi movie on the service, but if there are no subtitles or audio available in English we wouldn’t want to recommend it to a member who likes Sci-Fi movies but only speaks English. Alternatively, if the member speaks both English and French, then there is a good chance it would be an appropriate recommendation. People also often have preferences for watching content that was originally produced in their native language, or one they are fluent in. While we constantly try to add new language subtitles and dubs to our content, we do not yet have all languages available for all content. Furthermore, different people and cultures also have different preferences for watching with subtitles or dubs. Putting this together, it seems clear that recommendations could be better with an awareness of language preferences. However, currently which languages a member understands and to what degree is not defined explicitly, so we need to infer it from ancillary data and viewing patterns.

Challenge 4: Tracking Quality

The objective is to build recommendation algorithms that work equally well for all of our members; no matter where they live or what language they speak. But with so many members in so many countries speaking so many languages, a challenge we now face is how to even figure out when an algorithm is sub-optimal for some subset of our members.

To handle this, we could use some of the approaches for the challenges above. For example, we could look at the performance of our algorithms by manually slicing along a set of dimensions (country, language, catalog, …). However, some of these slices lead to very sparse and noisy data. At the other end of the scale we could be looking at metrics observed globally, but this would dramatically limit our ability to detect issues until they impact a large number of our members. One approach this problem is to learn how to best group observations for the purpose of automatically detecting outliers and anomalies. Just as we work on improving our recommendation algorithms, we are innovating our metrics, instrumentation and monitoring to improve their fidelity and through them our ability to detect new problems and highlight areas to improve our service.

Conclusion

To support a launch of this magnitude, we examined each and every algorithm that is part of our service and began to address these challenges. Along the way, we found not just approaches that will make Netflix better for those signing up in the 130 new countries, but in fact better for all Netflix members worldwide. For example, solving the first and the second challenges let us discover worldwide communities of interest so that we can make better recommendations. Solving the third challenge means that regardless of where our members are based, they can use Netflix in the language that suits them the best, and quickly find the content they’re looking for. Solving the fourth challenge means that we’re able to detect issues at a finer grain and so that our recommendation and search algorithms help all our members find content they love. Of course, our global journey is just beginning and we look forward to making our service dramatically better over time. If you are an algorithmic explorer who finds this type of adventure exciting, take a look at our current job openings.

↧

Caching for a Global Netflix

March 1, 2016, 7:07 am

≫ Next: IMF: A Prescription for Versionitis

≪ Previous: Recommending for the World

#CachesEverywhere

Netflix members have come to expect a great user experience when interacting with our service. There are many things that go into delivering a customer-focused user experience for a streaming service, including an outstanding content library, an intuitive user interface, relevant and personalized recommendations, and a fast service that quickly gets your favorite content playing at very high quality, to name a few.

The Netflix service heavily embraces a microservice architecture that emphasizes separation of concerns. We deploy hundreds of microservices, with each focused on doing one thing well. This allows our teams and the software systems they produce to be highly aligned while being loosely coupled. Many of these services are stateless, which makes it easier to (auto)scale them. They often achieve the stateless loose coupling by maintaining state in caches or persistent stores.

EVCache is an extensively used data-caching service that provides the low-latency, high-reliability caching solution that the Netflix microservice architecture demands.

It is a RAM store based on memcached, optimized for cloud use. EVCache typically operates in contexts where consistency is not a strong requirement. Over the last few years, EVCache has been scaled to significant traffic while providing a robust key-value interface. At peak, our production EVCache deployments routinely handle upwards of 30 million requests/sec, storing hundreds of billions of objects across tens of thousands of memcached instances. This translates to just under 2 trillion requests per day globally across all EVCache clusters.

Earlier this year, Netflix launched globally in 130 additional countries, making it available in nearly every country in the world. In this blog post we talk about how we built EVCache’s global replication system to meet Netflix’s growing needs. EVCache is open source, and has been in production for more than 5 years. To read more about EVCache, check out one of early blog posts.

Motivation

Netflix’s global, cloud-based service is spread across three Amazon Web Services (AWS) regions: Northern Virginia, Oregon, and Ireland. Requests are mostly served from the region the member is closest to. But network traffic can shift around for various reasons, including problems with critical infrastructure or region failover exercises (“Chaos Kong”). As a result, we have adopted a stateless application server architecture which lets us serve any member request from any region.

The hidden requirement in this design is that the data or state needed to serve a request is readily available anywhere. High-reliability databases and high-performance caches are fundamental to supporting our distributed architecture. One use case for a cache is to front a database or other persistent store. Replicating such caches globally helps with the “thundering herd” scenario: without global replication, member traffic shifting from one region to another would encounter “cold” caches for those members in the new region. Processing the cache misses would lengthen response times and overwhelm the databases.

Another major use case for caching is to “memoize” data which is expensive to recompute, and which doesn’t come from a persistent store. When the compute systems write this kind of data to a local cache, the data has to be replicated to all regions so it’s available to serve member requests no matter where they originate. The bottom line is that microservices rely on caches for fast, reliable access to multiple types of data like a member’s viewing history, ratings, and personalized recommendations. Changes and updates to cached data need to be replicated around the world to enable fast, reliable, and global access.

EVCache was designed with these use-cases in consideration. When we embarked upon the global replication system design for EVCache, we also considered non-requirements. One non-requirement is strong global consistency. It’s okay, for example, if Ireland and Virginia occasionally have slightly different recommendations for you as long as the difference doesn’t hurt your browsing or streaming experience. For non-critical data, we rely heavily on this “eventual consistency” model for replication where local or global differences are tolerated for a short time. This simplifies the EVCache replication design tremendously: it doesn’t need to deal with global locking, quorum reads and writes, transactional updates, partial-commit rollbacks, or other complications of distributed consistency.

We also wanted to make sure the replication system wouldn’t affect the performance and reliability of local cache operations, even if cross-region replication slowed down. All replication is asynchronous, and the replication system can become latent or fail temporarily without affecting local cache operations.

Replication latency is another loose requirement. How fast is fast enough? How often does member traffic switch between regions, and what is the impact of inconsistency? Rather than demand the impossible from a replication system ("instantaneous and perfect"), what Netflix needs from EVcache is acceptable latency while tolerating some inconsistency - as long as both are low enough to serve the needs of our applications and members.

Cross-Region Replication Architecture

EVCache replicates data both within a region and globally. The intra-region redundancy comes from a simultaneous write to all server groups within the region. For cross-region replication, the key components are shown in the diagram below.

Screen Shot 2016-02-19 at 12.42.41 PM.png

This diagram shows the replication steps for a SET operation. An application calls set() on the EVCache client library, and from there the replication path is transparent to the caller.

The EVCache client library sends the SET to the local region’s instance of the cache
The client library also writes metadata (including the key, but not the data) to the replication message queue (Kafka)
The “Replication Relay” service in the local region reads messages from this queue
The Relay fetches the data for the key from the local cache
The Relay sends a SET request to the remote region's “Replication Proxy” service
In the remote region, the Replication Proxy receives the request and performs a SET to its local cache, completing the replication
Local applications in the receiving region will now see the updated value in the local cache when they do a GET

This is a simplified picture, of course. For one thing, it refers only to SET - not other operations like DELETE, TOUCH, or batch mutations. The flows for DELETE and TOUCH are very similar, with some modifications: they don’t have to read the existing value from the local cache, for example.

It's important to note that the only part of the system that reaches across region boundaries is the message sent from the Replication Relay to the Replication Proxy (step 5). Clients of EVCache are not aware of other regions or of cross-region replication; reads and writes use only the local, in-region cache instances.

Component Responsibilities

Replication Message Queue

The message queue is the cornerstone of the replication system. We use Kafka for this. The Kafka stream for a fully-replicated cache has two consumers: one Replication Relay cluster for each destination region. By having separate clusters for each target region, we de-couple the two replication paths and isolate them from each other’s latency or other issues.

If a target region goes wildly latent or completely blows up for an extended period, the buffer for the Kafka queue will eventually fill up and Kafka will start dropping older messages. In a disaster scenario like this, the dropped messages are never sent to the target region. Netflix services which use replicated caches are designed to tolerate such occasional disruptions.

Replication Relay

The Replication Relay cluster consumes messages from the Kafka cluster. Using a secure connection to the Replication Proxy cluster in the destination region, it writes the replication request (complete with data fetched from the local cache, if needed) and awaits a success response. It retries requests which encounter timeouts or failures.

Temporary periods of high cross-region latency are handled gracefully: Kafka continues to accept replication messages and buffers the backlog when there are delays in the replication processing chain.

Replication Proxy

The Replication Proxy cluster for a cache runs in the target region for replication. It receives replication requests from the Replication Relay clusters in other regions and synchronously writes the data to the cache in its local region. It then returns a response to the Relay clusters, so they know the replication was successful.

When the Replication Proxy writes to its local region’s cache, it uses the same open-source EVCache client that any other application would use. The common client library handles all the complexities of sharding and instance selection, retries, and in-region replication to multiple cache servers.

As with many Netflix services, the Replication Relay and Replication Proxy clusters have multiple instances spread across Availability Zones (AZs) in each region to handle high traffic rates while being resilient against localized failures.

Design Rationale and Implications

The Replication Relay and Replication Proxy services, and the Kafka queue they use, all run separately from the applications that use caches and from the cache instances themselves. All the replication components can be scaled up or down as needed to handle the replication load, and they are largely decoupled from local cache read and write activity. Our traffic varies on a daily basis because of member watching patterns, so these clusters scale up and down all the time. If there is a surge of activity, or if some kind of network slowdown occurs in the replication path, the queue might develop a backlog until the scaling occurs, but latency of local cache GET/SET operations for applications won’t be affected.

As noted above, the replication messages on the queue contain just the key and some metadata, not the actual data being written. We get various efficiency wins this way. The major win is a smaller, faster Kafka deployment which doesn’t have to be scaled to hold all the data that exists in the caches. Storing large data payloads in Kafka would make it a costly bottleneck, due to storage and network requirements. Instead, the Replication Relay fetches the data from the local cache, with no need for another copy in Kafka.

Another win we get from writing just the metadata is that sometimes, we don’t need the data for replication at all. For some caches, a SET on a given key only needs to invalidate that key in the other regions - we don’t send the new data, we just send a DELETE for the key. In such cases, a subsequent GET in the other region results in a cache miss (rather than seeing the old data), and the application will handle it like any other miss. This is a win when the rate of cross-region traffic isn’t high - that is, when there are few GETs in region A for data that was written from region B. Handling these occasional misses is cheaper than constantly replicating the data.

Optimizations

We have to balance latency and throughput based on the requirements of each cache. The 99th percentile of end-to-end replication latency for most of our caches is under one second. Some of that time comes from a delay to allow for buffering: we try to batch up messages at various points in the replication flow to improve throughput at the cost of a bit of latency. The 99th percentile of latency for our highest-volume replicated cache is only about 400ms because the buffers fill and flush quickly.

Another significant optimization is the use of persistent connections. We found that the latency improved greatly and was more stable after we started using persistent connections between the Relay and Proxy clusters. It eliminates the need to wait for the 3-way handshake to establish a new TCP connection and also saves the extra network time needed to establish the TLS/SSL session before sending the actual replication request.

We improved throughput and lowered the overall communication latency between the Relay cluster and Proxy cluster by batching multiple messages in single request to fill a TCP window. Ideally the batch size would vary to match the TCP window size, which can change over the life of the connection. In practice we tune the batch size empirically for good throughput. While this batching can add latency, it allows us to get more out of each TCP packet and reduces the number of connections we need to set up on each instance, thus letting us use fewer instances for a given replication demand profile.

With these optimizations we have been able to scale EVCache’s cross-region replication system to routinely handle over a million RPS at peak daily.

Challenges and Learnings

The current version of our Kafka-based replication system has been in production for over a year and replicates more than 1.5 million messages per second at peak. We’ve had some growing pains during that time. We’ve seen periods of increased end-to-end latencies, sometimes with obvious causes like a problem with the Proxy application’s autoscaling rules, and sometimes without - due to congestion on the cross-region link on the public Internet, for example.

Before using VPC at Amazon, one of our biggest problems was the implicit packets-per-second limits on our AWS instances. Cross that limit, and the AWS instance experiences a high rate of TCP timeouts and dropped packets, resulting in high replication latencies, TCP retries, and failed replication requests which need to be retried later. The solution is simple: scale out. Using more instances means there is more total packets-per-second capacity. Sometimes two “large” instances are a better choice than a single “extra large,” even when the costs are the same. Moving into VPC significantly raised some limits, like packets per second, while also giving us access to other enhanced networking capabilities which allow the Relay and Proxy clusters to do more work per instance.

In order to be able to diagnose which link in the chain is causing latency, we introduced a number of metrics to track and monitor the latencies at different points in the system: from the client application to Kafka, in the Relay cluster’s reading from Kafka, from the Relay cluster to the remote Proxy cluster, and from Proxy cluster to its local cache servers. There are also end-to-end timing metrics to track how well the system is doing overall.

At this point, we have a few main issues that we are still working through:

Kafka does not scale up and down conveniently. When a cache needs more replication-queue capacity, we have to manually add partitions and configure the consumers with matching thread counts and scale the Relay cluster to match. This can lead to duplicate/re-sent messages, which is inefficient and may cause more than the usual level of eventual consistency skew.
If we lose an EVCache instance in the remote region, this results in an increase in latency as the Proxy cluster tries and fails to write to the missing instance. This latency leads back to the Relay side, which is awaiting confirmation for each (batched) replication request. We’ve worked to reduce the time spent in this state: we detect the lost instance earlier, and we are investigating reconciliation mechanisms to minimize the impact of these situations. We have made changes in the EVCache client that allow the Proxy instances to cope more easily with the possibility that cache instances can disappear.
Kafka monitoring, particularly for missing messages, is not an exact science. Software bugs can cause messages not to appear in the Kafka partition, or not to be received by our Relay cluster. We monitor by comparing the total number of messages received by our Kafka brokers (on a per topic basis) and the number of messages replicated by the Relay cluster. If there is more than a small acceptable threshold of difference for any significant time, we investigate. We also monitor maximum latencies (not the average), because the processing of one partition may be significantly slower for some reason. That situation requires investigation even if the average is acceptable. We are still improving these and other alerts to better detect real issues with fewer false-positives.

Future

We still have a lot of work to do on the replication system. Future improvements might involve pipelining replication messages on a single connection for better and more efficient connection use, optimizations to take better advantage of the network TCP window size, or transitioning to the new Kafka 0.9 API. We hope to make our Relay clusters (the Kafka consumers) autoscale cleanly without significantly increasing latencies or increasing the number of duplicate/re-sent messages.

EVCache is one of the critical components of Netflix's distributed architecture, providing globally replicated data at RAM speed so any member can be served from anywhere. In this post we covered how we took on the challenge of providing reliable and fast replication for caching systems at a global scale. We look forward to improving more as our needs evolve and as our global member base expands. As a company, we strive to win more of our member’s moments of truth and our team helps in that mission by building highly-available distributed caching systems at scale. If this is something you’d enjoy too, reach out to us - we’re hiring!

- The EVCache Team (Shashi Madappa, Vu Nguyen, Scott Mansfield, Sridhar Enugula, Allan Pratt, Faisal Zakaria Siddiqi)

↧

IMF: A Prescription for Versionitis

March 7, 2016, 9:18 am

≫ Next: How We Build Code at Netflix

≪ Previous: Caching for a Global Netflix

This blog post provides an introduction to the emerging IMF (Interoperable Master Format) standard from SMPTE (The Society of Motion Picture and Television Engineers), and delves into a short case study that highlights some of the operational benefits that Netflix receives from IMF today.

Have you ever noticed that your favorite movie or TV show feels a little different depending on whether you’re watching it on Netflix, on DVD, on an airplane or from your local cable provider? One reason could be that you’re watching a slightly different edit. In addition to changes for specific distribution channels (like theatrical widescreen, HD home video, airline edits, etc.), content owners typically need to create new versions of their movie or television show for distribution in different territories.

Netflix licenses the majority of its content from other owners, sometimes years after the original assets were created, and often for multiple territories. This leads to a number of problems, including receiving cropped or pan-and-scanned versions of films. We also frequently run into problems when we try to sync dubbed audio and/or subtitles. For example, a film shot and premiered theatrically at 24 frames per second (fps), may be converted to 29.97fps and/or re-cut for a specific distribution channel. Alternate language assets (like audio and timed text) are then created to match the derivative version.

In order to preserve the artist’s creative intent, Netflix always requests content in its original format (native aspect ratio, frame rate, etc.). In the case of a film, we would receive a 24fps theatrical version of the video, but the dubbed audio and subtitles won’t necessarily match, as they may have been created from the 29.97fps version, or even another version that was re-cut for international distribution. We’ve coined the term “Versionitis” to describe this asset-management malady.

Luckily, the good folks over at SMPTE (whom you may know from ubiquitous standards like countdown leader, timecode and color bars, among others) have been hard at work, capitalizing on some of the successes of digital cinema, to design a better system of component-ized file-based workflows with a solution to versioning right in its DNA. If not a cure for versionitis, we’re hoping that IMF will at least provide some relief from this pernicious condition.

The Interoperable Master Format

The advance of technology within the motion picture post-production industry has effected a paradigm shift, moving the industry from tape-based to file-based workflows. The need for a standardized set of specifications for the file-based workflow has given birth to the Interoperable Master Format (IMF). IMF is a file-based framework designed to facilitate the management and processing of multiple content versions (airline edits, special editions, alternate languages, etc.) of the same high-quality finished work (feature, episode, trailer, advertisement, etc.) destined for distribution channels worldwide. The key concepts underlying IMF include:

Facilitating an internal or business-to-business relationship. IMF is not intended to be delivered to the consumer directly.
While IMF is intended to be a specification for the Distribution Service Master, it could be used as an archival master as well.
Support for audio and video, as well as data essence in the form of subtitles, captions, etc.;
Support for descriptive and dynamic metadata (the latter can vary as a function of time) that is expected to be synchronized to an essence;
Wrapping (encapsulating) of media essence, data essence as well as dynamic metadata into well understood temporal units, called track files using the MXF (Material eXchange Format) file specification;
Each content version is embodied in a Composition, which combines metadata and essences. An example of a composition might be the US theatrical cut or an airline edit.
A Composition Playlist (CPL) defines the playback timeline for the Composition and includes metadata applicable to the Composition as a whole via XML.
IMF allows for the creation of many different distribution formats from the same composition. This can be accomplished by specifying the processing/transcoding instructions through an Output Profile List (OPL).

The Composition Playlist

The IMF Composition Playlist (CPL) XML defines the playback timeline for the Composition and includes metadata applicable to the Composition as a whole. The CPL is not designed to contain essence but rather reference external Track Files that contain the actual essence. This construct allows multiple compositions to be managed and processed without duplicating common essence files. The IMF CPL is constrained to contain exactly one video track.

The timeline of the CPL (light blue in example) contains multiple Segments designed to play sequentially. Each Segment (dark grey), in turn, contains multiple Sequences (e.g., an image sequence and an audio sequence, beige), that play in parallel. Each sequence is composed of multiple Resources (green and red for image and audio essence respectively) that refer to physical track files, and subsequently, the audio and video samples / frames that comprise the overall composition. In the example above, light grey portions of the track files represent essence samples that are not relevant to this composition.

The flexible CPL mechanism decouples the playback timeline from the underlying track files, allowing for economical and incremental updates to the timeline when necessary. Each CPL is associated with a universally unique identifier (UUID) that can be used to track versioning of the playback timeline. Likewise, resources within the CPL reference essence data via each track file’s UUID.

Composition Playlist for Supply Chain Automation

The core IMF principles help realize a better asset management system. In order to achieve a higher degree of ingest automation for Netflix’s Digital Supply Chain, additional information needs to be associated with an IMF delivery and meaningful constraints need to be applied to the IMF CPL. Examples of additional information include metadata that associates the viewable timeline with the the release title, regions and territories where the timeline can be viewed, and content maturity ratings. The IMF Composition Playlist defines optional constructs that can carry such information thus enabling an opportunity for tighter integration with business systems of various players in the entertainment industry eco-system.

Anatomy of an IMP

Asset delivery and playback timeline aspects are decoupled in IMF. The unit of delivery between two businesses is called an Interoperable Master Package (IMP). An IMP can be described as follows:

An Interoperable Master Package (IMP) shall consist of one Packing List (PKL - an XML file that describes a list of files), and all the files it references
An IMP (equivalently, the PKL) can contain one or more complete or incomplete Compositions
A Complete IMP is an IMP containing the complete set of assets comprising one or more Compositions. Mathematically, a complete IMP is such that all of the asset references of all of the CPLs described in the PKL are also contained in the PKL
A Partial IMP is an IMP containing one or more incomplete Compositions. In other words, some assets needed to complete the composition are not present in the package i.e., some of the assets referred to by a CPL are not contained in the PKL Depending upon the order in which IMPs arrive into a content ingestion system, the dangling references associated with a partial IMP may be resolved using assets that came with IMPs previously ingested into the system or may be resolved in the future as more IMPs are ingested.

In relation to the example above, the indicated composition could be delivered as a single, complete IMP. In this case, the IMP would contain the CPL file with UUID1, image essence track files with UUID6, UUID7 and UUID8 respectively, and audio essence track files with UUID11 and UUID12 respectively.

The same composition could also be delivered as multiple partial IMPs. One such scenario could comprise an IMP1 containing CPL file with UUID1 and one audio essence track file with UUID11, and an IMP2 containing image essence track files with UUID6, UUID7 and UUID8 respectively and the audio essence track file with UUID12.

Case Study - House of Cards Season 3

Netflix started ingesting Interoperable Master Packages in 2014, when we started receiving Breaking Bad 4K masters (see here). Initial support was limited to complete IMPs (as defined above), with constrained CPLs that only referenced one ImageSequence and up to two AudioSequences, each contained in its own track file. CPLs referencing multiple track files, with timeline offsets, were not supported, so these early IMPs are very similar to a traditional muxed audio / video file.

In February of 2015, shortly before the House of Cards Season 3 release date, the Netflix ident (the animated Netflix logo that precedes and follows a Netflix Original) was given the gift of sound.

Screen Shot 2016-02-25 at 10.51.33 AM.png

Unfortunately, all episodes of House of Cards had already been mastered and ingested with the original video-only ident, as had all of the alternative language subtitles and dubbed audio tracks. To this date House of Cards has represented a number of critical milestones for Netflix, and it was important to us to launch season 3 with the new ident. While addressing this problem would have been an expensive, operationally demanding, and very manual process in the pre-IMF days, requiring re-QC of all of our masters and language assets (dubbed audio and subtitles) for all episodes, instead it was a relatively simple exercise in IMF versioning and component-ized delivery.

Rather than requiring an entirely new master package, the addition of ident audio to each episode required only new per-episode CPLs. These new CPLs were identical to the old, but referenced a different set of audio track files for the first ~100 frames and the last ~100 frames. Because this did not change the overall duration of the timeline, and did not adjust the timing of any other audio or video resources, there was no danger of other, already encoded, synchronized assets (like dubbed audio or subtitles) falling out-of-sync as a result of the change.

To Be Continued …

Next in this series, we will describe our IMF ingest implementation and how it fits into our content processing pipeline.

By Rohit Puri, Andy Schuler and Sreeram Chakrovorthy

↧

How We Build Code at Netflix

March 9, 2016, 6:57 am

≫ Next: Stream-processing with Mantis

≪ Previous: IMF: A Prescription for Versionitis

How does Netflix build code before it’s deployed to the cloud? While pieces of this story have been told in the past, we decided it was time we shared more details. In this post, we describe the tools and techniques used to go from source code to a deployed service serving movies and TV shows to more than 75 million global Netflix members.

The above diagram expands on a previous post announcing Spinnaker, our global continuous delivery platform. There are a number of steps that need to happen before a line of code makes it way into Spinnaker:

Code is built and tested locally using Nebula
Changes are committed to a central git repository
A Jenkins job executes Nebula, which builds, tests, and packages the application for deployment
Builds are “baked” into Amazon Machine Images
Spinnaker pipelines are used to deploy and promote the code change

The rest of this post will explore the tools and processes used at each of these stages, as well as why we took this approach. We will close by sharing some of the challenges we are actively addressing. You can expect this to be the first of many posts detailing the tools and challenges of building and deploying code at Netflix.

Culture, Cloud, and Microservices

Before we dive into how we build code at Netflix, it’s important to highlight a few key elements that drive and shape the solutions we use: our culture, the cloud, and microservices.

The Netflix culture of freedom and responsibility empowers engineers to craft solutions using whatever tools they feel are best suited to the task. In our experience, for a tool to be widely accepted, it must be compelling, add tremendous value, and reduce the overall cognitive load for the majority of Netflix engineers. Teams have the freedom to implement alternative solutions, but they also take on additional responsibility for maintaining these solutions. Tools offered by centralized teams at Netflix are considered to be part of a “paved road”. Our focus today is solely on the paved road supported by Engineering Tools.

In addition, in 2008 Netflix began migrating our streaming service to AWS and converting our monolithic, datacenter-based Java application to cloud-based Java microservices. Our microservice architecture allows teams at Netflix to be loosely coupled, building and pushing changes at a speed they are comfortable with.

Build

Naturally, the first step to deploying an application or service is building. We created Nebula, an opinionated set of plugins for the Gradle build system, to help with the heavy lifting around building applications. Gradle provides first-class support for building, testing, and packaging Java applications, which covers the majority of our code. Gradle was chosen because it was easy to write testable plugins, while reducing the size of a project's build file. Nebula extends the robust build automation functionality provided by Gradle with a suite of open source plugins for dependency management, release management, packaging, and much more.

A simple Java application build.gradle file.

The above ‘build.gradle’ file represents the build definition for a simple Java application at Netflix. This project’s build declares a few Java dependencies as well as applying 4 Gradle plugins, 3 of which are either a part of Nebula or are internal configurations applied to Nebula plugins. The ‘nebula’ plugin is an internal-only Gradle plugin that provides convention and configuration necessary for integration with our infrastructure. The ‘nebula.dependency-lock’ plugin allows the project to generate a .lock file of the resolved dependency graph that can be versioned, enabling build repeatability. The ‘netflix.ospackage-tomcat’ plugin and the ospackage block will be touched on below.

With Nebula, we provide reusable and consistent build functionality, with the goal of reducing boilerplate in each application’s build file. A future techblog post will dive deeper into Nebula and the various features we’ve open sourced. For now, you can check out the Nebula website.

Integrate

Once a line of code has been built and tested locally using Nebula, it is ready for continuous integration and deployment. The first step is to push the updated source code to a git repository. Teams are free to find a git workflow that works for them.

Once the change is committed, a Jenkins job is triggered. Our use of Jenkins for continuous integration has evolved over the years. We started with a single massive Jenkins master in our datacenter and have evolved to running 25 Jenkins masters in AWS. Jenkins is used throughout Netflix for a variety of automation tasks above just simple continuous integration.

A Jenkins job is configured to invoke Nebula to build, test and package the application code. If the repository being built is a library, Nebula will publish the .jar to our artifact repository. If the repository is an application, then the Nebula ospackage pluginwill be executed. Using the Nebula ospackage (short for “operating system package”) plugin, an application’s build artifact will be bundled into either a Debian or RPM package, whose contents are defined via a simple Gradle-based DSL. Nebula will then publish the Debian file to a package repository where it will be available for the next stage of the process, “baking”.

Bake

Our deployment strategy is centered around the Immutable Server pattern. Live modification of instances is strongly discouraged in order to reduce configuration drift and ensure deployments are repeatable from source. Every deployment at Netflix begins with the creation of a new Amazon Machine Image, or AMI. To generate AMIs from source, we created “the Bakery”.

The Bakery exposes an API that facilitates the creation of AMIs globally. The Bakery API service then schedules the actual bake job on worker nodes that use Aminator to create the image. To trigger a bake, the user declares the package to be installed, as well the foundation image onto which the package is installed. That foundation image, or Base AMI, provides a Linux environment customized with the common conventions, tools, and services required for seamless integration with the greater Netflix ecosystem.

When a Jenkins job is successful, it typically triggers a Spinnaker pipeline. Spinnaker pipelines can be triggered by a Jenkins job or by a git commit. Spinnaker will read the operating system package generated by Nebula, and call the Bakery API to trigger a bake.

Deploy

Once a bake is complete, Spinnaker makes the resultant AMI available for deployment to tens, hundreds, or thousands of instances. The same AMI is usable across multiple environments as Spinnaker exposes a runtime context to the instance which allows applications to self-configure at runtime. A successful bake will trigger the next stage of the Spinnaker pipeline, a deploy to the test environment. From here, teams will typically exercise the deployment using a battery of automated integration tests. The specifics of an application’s deployment pipeline becomes fairly custom from this point on. Teams will use Spinnaker to manage multi-region deployments, canary releases, red/black deployments and much more. Suffice to say that Spinnaker pipelines provide teams with immense flexibility to control how they deploy code.

The Road Ahead

Taken together, these tools enable a high degree of efficiency and automation. For example, it takes just 16 minutes to move our cloud resiliency and maintenance service, Janitor Monkey, from code check-in to a multi-region deployment.

A Spinnaker bake and deploy pipeline triggered from Jenkins.

That said, we are always looking to improve the developer experience and are constantly challenging ourselves to do it better, faster, and while making it easier.

One challenge we are actively addressing is how we manage binary dependencies at Netflix. Nebula provides tools focused on making Java dependency management easier. For instance, the Nebula dependency-lock plugin allows applications to resolve their complete binary dependency graph and produce a .lock file which can be versioned. The Nebula resolution rules plugin allows us to publish organization-wide dependency rules that impact all Nebula builds. These tools help make binary dependency management easier, but still fall short of reducing the pain to an acceptable level.

Another challenge we are working to address is bake time. It wasn’t long ago that 16-minutes from commit to deployment was a dream, but as other parts of the system have gotten faster, this now feels like an impediment to rapid innovation. From the Simian Army example deployment above, the bake process took 7 minutes or 44% of the total bake and deploy time. We have found the biggest drivers of bake time to be installing packages (including dependency resolution) and the AWS snapshot process itself.

As Netflix grows and evolves, there is an increasing demand for our build and deploy toolset to provide first-class support for non-JVM languages, like JavaScript/Node.js, Python, Ruby and Go. Our current recommendation for non-JVM applications is to use the Nebula ospackage plugin to produce a Debian package for baking, leaving the build and test pieces to the engineers and the platform’s preferred tooling. While this solves the needs of teams today, we are expanding our tools to be language agnostic.

Containers provide an interesting potential solution to the last two challenges and we are exploring how containers can help improve our current build, bake, and deploy experience. If we can provide a local container-based environment that closely mimics that of our cloud environments, we potentially reduce the amount of baking required during the development and test cycles, improving developer productivity and accelerating the overall development process. A container that can be deployed locally just as it would be in production without modification reduces cognitive load and allows our engineers to focus on solving problems and innovating rather than trying to determine if a bug is due to environmental differences.

You can expect future posts providing updates on how we are addressing these challenges. If these challenges sound exciting to you, come join the Engineering Tools team. You can check out our open jobs and apply today!

By Ed Bukoski, Brian Moyles and Mike McGarr

↧

Stream-processing with Mantis

March 14, 2016, 9:35 am

≫ Next: Extracting image metadata at scale

≪ Previous: How We Build Code at Netflix

Back in January of 2014 we wrote about the need for better visibility into our complex operational environments. The core of the message in that post was about the need for fine-grained, contextual and scalable insights into the experiences of our customers and behaviors of our services. While our execution has evolved somewhat differently from our original vision, the underlying principles behind that vision are as relevant today as they were then. In this post we’ll share what we’ve learned building Mantis, a stream-processing service platform that’s processing event streams of up to 8 million events per second and running hundreds of stream-processing jobs around the clock. We’ll describe the architecture of the platform and how we’re using it to solve real-world operational problems.

Why Mantis?

There are more than 75 million Netflix members watching 125 million hours of content every day in over 190 countries around the world. To provide an incredible experience for our members, it’s critical for us to understand our systems at both the coarse-grained service level and fine-grained device level. We’re good at detecting, mitigating, and resolving issues at the application service level - and we’ve got some excellent tools for service-level monitoring - but when you get down to the level of individual devices, titles, and users, identifying and diagnosing issues gets more challenging.

We created Mantis to make it easy for teams to get access to realtime events and build applications on top of them. We named it after the Mantis shrimp, a freakish yet awesome creature that is both incredibly powerful and fast. The Mantis shrimp has sixteen photoreceptors in its eyes compared to humans’ three. It has one of the most unique visual systems of any creature on the planet. Like the shrimp, the Mantis stream-processing platform is all about speed, power, and incredible visibility.

So Mantis is a platform for building low-latency, high throughput stream-processing apps but why do we need it? It’s been said that the Netflix microservices architecture is a metrics generator that occasionally streams movies. It’s a joke, of course, but there’s an element of truth to it; our systems do produce billions of events and metrics on a daily basis. Paradoxically, we often experience the problem of having both too much data and too little at the same time. Situations invariably arise in which you have thousands of metrics at your disposal but none are quite what you need to understand what’s really happening. There are some cases where you do have access to relevant metrics, but the granularity isn’t quite good enough for you to understand and diagnose the problem you’re trying to solve. And there are still other scenarios where you have all the metrics you need, but the signal-to-noise ratio is so high that the problem is virtually impossible to diagnose. Mantis enables us to build highly granular, realtime insights applications that give us deep visibility into the interactions between Netflix devices and our AWS services. It helps us better understand the long tail of problems where some users, on some devices, in some countries are having problems using Netflix.

By making it easier to get visibility into interactions at the device level, Mantis helps us “see” details that other metrics systems can’t. It’s the difference between 3 photoreceptors and 16.

A Deeper Dive

With Mantis, we wanted to abstract developers away from the operational overhead associated with managing their own cluster of machines. Mantis was built from ground up to be cloud native. It manages a cluster of EC2 servers that is used to run stream-processing jobs. Apache Mesos is used to abstract the cluster into a shared pool of computing resources. We built, and open-sourced, a custom scheduling library called Fenzo to intelligently allocate these resources among jobs.

Architecture Overview

The Mantis platform comprises a master and an agent cluster. Users submit stream-processing applications as jobs that run as one or more workers on the agent cluster. The master consists of a Resource Manager that uses Fenzo to optimally assign resources to a jobs’ workers. A Job Manager embodies the operational behavior of a job including metadata, SLAs, artifact locations, job topology and life cycle.

The following image illustrates the high-level architecture of the system.

Mantis Jobs

Mantis provides a flexible model for defining a stream-processing job. A mantis job can be defined as single-stage for basic transformation/aggregation use cases or multi-stage for sharding and processing high-volume, high-cardinality event streams.

There are three main parts to a Mantis job.

The source is responsible for fetching data from an external source
One or more processing stages which are responsible for processing incoming event streams using high order RxJava functions
The sink to collect and output the processed data

RxNetty provides non-blocking access to the event stream for a job and is used to move data between its stages.

To give you a better idea of how a job is structured, let's take a look at a typical ‘aggregate by group’ example.

Imagine that we are trying to process logs sent by devices to calculate error rates per device type. The job is composed of three stages. The first stage is responsible for fetching events from a device log source job and grouping them based on device ID. The grouped events are then routed to workers in stage 2 such that all events for the same group (i.e., device ID) will get routed to the same worker. Stage 2 is where stateful computations like windowing and reducing - e.g., calculating error rate over a 30 second rolling window - are performed. Finally the aggregated results for each device ID are collected by Stage 3 and made available for dashboards or other applications to consume.

Job Chaining

One of the unique features of Mantis is the ability to chain jobs together. Job chaining allows for efficient data and code reuse. The image below shows an example of an anomaly detector application composed of several jobs chained together. The anomaly detector streams data from a job that serves Zuul request/response events (filtered using a simple SQL-like query) along with output from a “Top N” job that aggregates data from several other source jobs.

Scaling in Action

At Netflix the amount of data that needs to be processed varies widely based on the time of the day. Running with peak capacity all the time is expensive and unnecessary. Mantis autoscales both the cluster size and the individual jobs as needed.

The following chart shows how Fenzo autoscales the Mesos worker cluster by adding and removing EC2 instances in response to demand over the course of a week.

And the chart below shows an individual job’s autoscaling in action, with additional workers being added or removed based on demand over a week.

UI for Self-service, API for Integration

Mantis sports a dedicated UI and API for configuring and managing jobs across AWS regions. Having both a UI and API improves the flexibility of the platform. The UI gives users the ability to quickly and manually interact with jobs and platform functionality while the API enables easy programmatic integration with automated workflows.

The jobs view in the UI, shown below, lets users quickly see which jobs are running across AWS regions along with how many resources the jobs are consuming.

Each job instance is launched as part of a job cluster, which you can think of as a class definition or template for a Mantis job. The job cluster view shown in the image below provides access to configuration data along with a view of running jobs launched from the cluster config. From this view, users are able to update cluster configurations and submit new job instances to run.

How Mantis Helps Us

Now that we’ve taken a quick look at the overall architecture for Mantis, let’s turn our attention to how we’re using it to improve our production operations. Mantis jobs currently process events from about 20 different data sources including services like Zuul, API, Personalization, Playback, and Device Logging to name a few.

Of the growing set of applications built on these data sources, one of the most exciting use cases we’ve explored involves alerting on individual video titles across countries and devices.

One of the challenges of running a large-scale, global Internet service is finding anomalies in high-volume, high-cardinality data in realtime. For example, we may need access to fine-grained insights to figure out if there are playback issues with House of Cards, Season 4, Episode 1 on iPads in Brazil. To do this we have to track millions of unique combinations of data (what we call assets) all the time, a use case right in Mantis’ wheelhouse.

Let’s consider this use case in more detail. The rate of events for a title asset (title * devices * country) shows a lot of variation. So a popular title on a popular device can have orders of magnitude more events than lower usage title and device combinations. Additionally for each asset, there is high variability in event rate based on the time of the day. To detect anomalies, we track rolling windows of unique events per asset. The size of the window and alert thresholds vary dynamically based on the rate of events. When the percentage of anomalous events exceeds the threshold, we generate an alert for our playback and content platform engineering teams. This approach has allowed us to quickly identify and correct problems that would previously go unnoticed or, best case, would be caught by manual testing or be reported via customer service.

Below is a screen from an application for viewing playback stats and alerts on video titles. It surfaces data that helps engineers find the root cause for errors.

In addition to alerting at the individual title level, we also can do realtime alerting on our key performance indicator: SPS. The advantage of Mantis alerting for SPS is that it gives us the ability to ratchet down our time to detect (TTD) from around 8 minutes to less than 1 minute. Faster TTD gives us a chance to resolve issues faster (time to recover, or TTR), which helps us win more moments of truth as members use Netflix around the world.

Where are we going?

We’re just scratching the surface of what’s possible with realtime applications, and we’re exploring ways to help more teams harness the power of stream-processing. For example, we’re working on improving our outlier detection system by integrating Mantis data sources, and we’re working on usability improvements to get teams up and running more quickly using self-service tools provided in the UI.

Mantis has opened up insights capabilities that we couldn’t easily achieve with other technologies and we’re excited to see stream-processing evolve as an important and complementary tool in our operational and insights toolset at Netflix.

If the work described here sounds exciting to you, head over to our jobs page; we’re looking for great engineers to join us on our quest to reinvent TV!

by Ben Schmaus, Chris Carey, Neeraj Joshi, Nick Mahilani, and Sharma Podila

↧

Extracting image metadata at scale

March 21, 2016, 11:32 am

≫ Next: Performance without Compromise

≪ Previous: Stream-processing with Mantis

We have a collection of nearly two million images that play very prominent roles in helping members pick what to watch. This blog describes how we use computer vision algorithms to address the challenges of focal point, text placement and image clustering at a large scale.

Focal point

All images have a region that is the most interesting (e.g. a character’s face, sharpest region, etc.) part of the image. In order to effectively render an image on a variety of canvases like a phone screen or TV, it is often required to display only the interesting region of the image and dynamically crop the rest of an image depending on the available real-estate and desired user experience. The goal of the focal point algorithm is to use a series of signals to identify the most interesting region of an image, then use that information to dynamically display it.

[Examples of face and full-body features to determine the focal point of the image]

We first try to identify all the people and their body positioning using Haar-cascade like features. We also built haar based features to also identify if it is close-up, upper-body or a full-body shot of the person(s). With this information, we were able to build an algorithm that auto-selects what is considered the "best' or "most interesting" person and then focuses in on that specific location.

However, not all images have humans in them. So, to identify interesting regions in those cases, we created a different signal - edges. We heuristically identify the focus of an image based on first applying gaussian blur and then calculating edges for a given image.

Here is one example of applying such a transformation:

///Remove noise by blurring with a Gaussian filter

GaussianBlur( src, src, Size(n,n ), 0, 0, BORDER_DEFAULT );

/// Convert the image to grayscale

cvtColor( src, src_gray, CV_BGR2GRAY );

/// Apply Laplace function

Laplacian( src_gray, dst, ddepth, kernel_size, scale, delta, BORDER_CONSTANT );

convertScaleAbs( dst, abs_dst );

Below are a few examples of dynamically cropped images based on focal point for different canvases:

Text Placement

Another interesting challenge is determining what would be the best place to put text on an image. Examples of this are the ‘New Episode’ Badge and placement of subtitles in a video frame.

[Example of “New Episode” badge hiding the title of the show]

In both cases, we’d like to avoid placing new text on top of existing text on these images.

Using a text detection algorithm allows us to automatically detect and correct such cases. However, text detection algorithms have many false positives. We apply several transformations like watershed and thresholding before applying text detection. With such transformations, we can get fairly accurate probability of text in a region of interest for image in large corpus of images.

[Results of text detection on some of the transformations of the same image]

Image Clustering

Images play an important role in a member’s decision to watch a particular video. We constantly test various flavors of artwork for different titles to decide which one performs the best. In order to learn which image is more effective globally, we would like to see how an image performs in a given region. To get an overall global view of how well a particular set of visually similar images performed globally, it is required to group them together based on their visual similarity.

We have several derivatives of the same image to display for different users. Although visually similar, not all of these images come from the same source. These images have varying degrees of image cropping, resizing, color correction and title treatment to serve a global audience.

As a global company that is constantly testing and experimenting with imagery, we have a collection of millions of images that we are continuously shifting and evolving. Manually grouping these images and maintaining those images can be expensive and time consuming, so we wanted to create a process that was smarter and more efficient.

[An example of two images with slight color correction, cropping and localized title treatment]

These images are often transformed and color corrected so a traditional color histogram based comparison does not always work for such automated grouping. Therefore, we came up with an algorithm that uses the following combination of parameters to determine a similarity index - measurement of visual similarity among group of images.

We calculate similarity index based on following 4 parameters:

Histogram based distance
Structural similarity between two images
Feature matching between two images
Earth mover’s distance algorithm to measure overall color similarity

Using all 4 methods, we can get a numerical value of similarity between two images in a relatively fast comparison.

Below is example of images grouped based on a similarity index that is invariant to color correction, title treatment, cropping and other transformations:

[Final result with similarity index values for group of images]

Images play a crucial role in first impression of a large collection of videos, and we are just scratching the surface on what we can learn from media and we have many more ambitious and interesting problems to tackle in the road ahead.

If you are excited and passionate about solving big problems, we are hiring. Contact us

By Apurva Kansara

↧