Introducing Raigad - An Elasticsearch Sidecar
Billing & Payments Engineering Meetup II
Now that Netflix has gained a tremendous experience with AWS, the Payments Engineering team has re-engineered their suite of applications into the cloud. It’s the first time payments are processed from a public cloud solution at this scale.
Mat's team is hiring!
Senior Software Engineer in Test - Payments Platform
At Billing we are at the crossroads, where we are half way still in our old data center and half way migrated to cloud. Billing Engineering has 2 major aspects - One is batch renewal of Netflix subscribing customers and other is the APIs that change the billing state of a Netflix customer in some way. Our topic for discussion was how Billing engineering is managing its workflow for these APIs across different processes and teams in this scenario and technology stack we are using to accomplish this.
Sangeeta’s team is hiring!
Senior Software Engineer in Test - Billing Platform
Netflix Product has been data driven since inception and payment processing at Netflix is no different. With more than 55M customers paying Netflix on a monthly basis, there is lots of data to analyze and recommend dynamic routing of transactions to maximize approval rates. At the meetup, Shankar Vedaraman, who leads the Payment Analytics Data Science and Engineering team, presented all the different payments business processes that his team focusses on and touched upon key analytical insights that his team provides.
Shankar's team is hiring!
Poorna Udupi who leads the Product and Application Security team at Netflix, spoke about making security consumable in the form of tools, libraries and self-service applications to enable developers attain a rapid velocity of feature delivery while simultaneously being secure. Specifically speaking to the audience of billing and payments enthusiasts, he discussed a few security techniques in detail: infrastructure segmentation, tokenization, utilization of big data for fraud and abuse detection, prevention and sanitization. He provided a lightning overview of some of the open source security projects contributed by his team such as Scumblr, Sketchy and others in the pipeline that focus on automating away security functions so that his team can focus on security feature experimentation and innovation.
Poorna's team is hiring!
Rahul Dani, who leads the Growth Product Engineering team at Netflix, talked about the adventure in steering the middle tier signup apps out of PCI scope and into a PCI free environment.
Rahul's team is hiring!
Extracting contextual information from video assets
Part 1: Detecting End-Sequences
When you finish watching a movie, we are able to provide a unique post-play experience as illustrated below in two examples. The user is presented with the next in a series of, or content similar to, the most recently seen video. Yet, the primary issue similarly remains isolating the salient parts of series and movies without the mind-boggling challenge of manually tagging the large and ever-changing catalog for the end points. In other words, we must devise a strategy for detecting when a video ends and the end-sequence begins. Interestingly, the end-sequence is unique in a few striking ways. First, that it appears at the end of the movie. Second, it almost always is comprised of text. Finally, there is very little variation between contiguous frames. Using all three of these conditions, we created an algorithm that successfully extracts the beginning of the end-sequence.
Two examples of Netflix post-play experiences
Below you'll find an example of text-detected regions (highlighted with yellow rectangles) on the end-sequence of Orange is the New Black:
Automated text detection of end sequence
Part 2: Detecting Similar Frames Across Multiple Video Assets
Summary
Introducing Vector: Netflix's On-Host Performance Monitoring Tool

Vector provides a simple way for users to visualize and analyze system and application-level metrics in near real-time. It leverages the battle tested open source system monitoring framework, Performance Co-Pilot (PCP), layering on top a flexible and user-friendly UI. The UI polls metrics at up to 1 second resolution, rendering the data in completely configurable dashboards that simplify cross-metric correlation and analysis.
PCP’s stateless model makes it lightweight and robust. Its overhead on hosts is negligible, as clients are responsible for keeping track of state, sampling rate, and computation. Additionally, metrics are not aggregated across hosts or persisted outside of the user’s browser session, keeping the framework light. Vector requires only your local browser and PCP installed on the host you wish to monitor. No intermediate collector, server, or database infrastructure is required.
We are excited to release Vector to the community and look forward to feedback and collaboration!
High-Level Architecture
Getting Started
Performance Co-Pilot (PCP)
Vector
Dashboards & Widgets
CPU
- Load Average
- Runnable
- CPU Utilization
- Per-CPU Utilization
- Context Switches
Memory
- Memory Utilization
- Page Faults
Disk
- Disk IOPS
- Disk Throughput
- Disk Utilization
- Disk Latency
Network
- Network Drops
- TCP Retransmits
- TCP Connections
- Network Throughput
- Network Packets
Next Steps
- More widgets and dashboards
- User-defined dashboards
- Metric snapshots
- CPU Flame Graphs
- Disk Latency Heat Maps
- Integration with Servo
- Support for Cassandra
Conclusion
Learning a Personalized Homepage
Page-level algorithmic challenge
Building a page algorithmically
Machine Learning for page generation
Page-level metrics
Other challenges
Conclusion
Introducing FIDO: Automated Security Incident Response
We're excited to announce the open source release of FIDO (Fully Integrated Defense Operation - apologies to the FIDO Alliance for acronym collision), our system for automatically analyzing security events and responding to security incidents.
Overview
The typical process for investigating security-related alerts is labor intensive and largely manual. To make the situation more difficult, as attacks increase in number and diversity, there is an increasing array of detection systems deployed and generating even more alerts for security teams to investigate.Netflix, like all organizations, has a finite amount of resources to combat this phenomenon, so we built FIDO to help. FIDO is an orchestration layer that automates the incident response process by evaluating, assessing and responding to malware and other detected threats.
The idea for FIDO came from a simple proof of concept a number of years ago. Our process for handling alerts from one of our network-based malware systems was to have a help desk ticket created and assigned to a desktop engineer for follow-up - typically a scan of the impacted system or perhaps a re-image of the hard drive. The time from alert generation to resolution of these tickets spanned from days to over a week. Our help desk system had an API, so we had a hypothesis that we could cut down resolution time by automating the alert-to-ticket process. The simple system we built to ingest the alerts and open the tickets cut the resolution time to a few hours, and we knew we were onto something - thus FIDO was born.
Architecture and Operation
This section describes FIDO's operation, and the following diagram provides an overview of FIDO’s architecture.
Detection
FIDO’s operation begins with the receipt of an event via one of FIDO’s detectors. Detectors are off the shelf security products (e.g. firewalls, IDS, anti-malware systems) or custom systems that detect malicious activities or threats. Detectors generate alerts or messages that FIDO ingests for further processing. FIDO provides a number of ways to ingest events, including via API (the preferred method), SQL database, log file, and email. FIDO supports a variety of detectors currently (e.g. Cyphort, ProtectWise, CarbonBlack/Bit9) with more planned or under development.Analysis and Enrichment
The next phase of FIDO operation involves deeper analysis of the event and enrichment of the event data with both internal and external data sources. Raw security events often have little associated context, and this phase of operation is designed to supplement the raw event data with supporting information to enable more accurate and informed decision making.The first component of this phase is analysis of the event’s target - typically a computer and/or user (but potentially any targeted resource). Is the machine a Windows host or a Linux server? Is it in the PCI zone? Does the system have security software installed and the latest patches? Is the targeted user a Domain Administrator? An executive? Having answers to these questions allows us to better evaluate the threat and determine what actions need to be taken (and with what urgency). To gather this data, FIDO queries various internal data sources - currently supported are Active Directory, LANDesk, and JAMF, with other sources under consideration.
In addition to querying internal sources, FIDO consults external threat feeds for information relevant to the event under analysis. The use of threat feeds help FIDO determine whether a generated event may be a false positive or how serious and pervasive the issue may be. Another way to think of this step is ‘never trust, always verify.’ A generated alert is simply raw data - it must be enriched, evaluated, and corroborated before actioning. FIDO supports several threats feeds, including ThreatGrid and VirusTotal, with additional feeds under consideration.
Correlation and Scoring
Once internal and external data has been gathered about a given event and its target(s), FIDO seeks to correlate the information with other data it has seen and score the event to facilitate ultimate disposition. The correlation component serves several functions - first - have multiple detectors identified this same issue? If so, it could potentially be a more serious threat. Second - has one of your detectors already blocked or remediated the issue (for example - a network-based malware detector identifies an issue, and a separate host-based system repels the same item)? If the event has already been addressed by one of your controls, FIDO may simply provide a notification that requires no further action. The following image gives a sense of how the various scoring components work together.Notification and Enforcement
In this phase, FIDO determines and executes a next action based on the ingested event, collected data, and calculated scores. This action may simply be an email to the security team with details or storing the information for later retrieval and analysis. Or, FIDO may implement more complex and proactive measures such as disabling an account, ending a VPN session, or disabling a network port. Importantly, the vast majority of enforcement logic in FIDO has been Netflix-specific. For this reason, we’ve removed most of this logic and code from the current OSS version of FIDO. We will re-implement this functionality in the OSS version when we are better able to provide the end-user reasonable and scalable control over enforcement customization and actions.Open Items & Future Plans
Netflix has been using FIDO for a bit over 4 years, and while it is meeting our requirements well, we have a number of features and improvements planned. On the user interface side, we are planning for an administrative UI with dashboards and assistance for enforcement configuration. Additional external integrations planned include PAN, OpenDNS, and SentinelOne. We're also working on improvements around correlation and host detection. And, because it's now OSS, you are welcome to suggest and submit your own improvements!Netflix Streaming - More Energy Efficient than Breathing
Netflix Streaming: Energy Consumption for 2014 was 0.0013 kWh per Streaming Hour Delivered
36% was from renewable sources
28% was offset with renewable energy credits
- We plan to be fully offset by 2015, and to increase the contribution of renewable sources
Carbon footprint of about 300g of CO2 per customer represents about 0.007% of the typical US household footprint of 43,000 kg (48 tons) of CO2 per year
- The majority of our technology is operated in the Amazon Web Services (AWS) cloud platform. AWS offers us unprecedented global scale, hosting tens of thousands of virtual instances and many petabytes of data across several cloud regions.
- The audio-video media itself is delivered from “Open Connect” content servers, which are forward positioned close to, or inside of, ISP networks for efficient delivery.
- The ISP networks, which carry the data across “the last mile” from our content servers to our customers.
- The “consumer premises equipment” (CPE) that includes cable or DSL modems, routers, WiFi access points, set-top boxes, and TVs, laptops, tablets, and phones.
AWS Footprint
ISPs
Consumer Premise Equipment
Comparisons
Localization Technologies at Netflix
- A developer makes a change to the English string data in a bundle in a namespace
- Translation workflows are automatically triggered
- Linguist completes the translation workflow
- Translations are made available to the bundle in the namespace
- Runtime: Allows fast propagation of changes to UIs
- Build time: Uses Global String Repository solely for localization but packages the data with the builds
NTS: Real-time Streaming for Test Automation
by Peter Hausel and Jwalant Shah
Netflix Test Studio
- Collect test results in near-realtime.
- A highly event driven architecture allows us to accomplish this: JSON snippets sent from the single page UI to the device and JavaScript listeners on the device firing back events. We also have a requirement to be able to play back events as they happened, just like a state machine.
- Allow testers to interact with both the device and various Netflix services during execution.
- Integrated tests require the control of the test execution stream in order to simulate real-world conditions. We want to simulate failures, pause, debug and resume during test execution.
A Typical NTS Test:
- Test Executor submits events in a time series fashion to a Websocket Bus which terminates at Dispatcher.
- Client connects to a Dispatcher with session Id information. One-to-many relationship between Dispatcher and TestExecutors.
- Dispatcher instance keeps an internal lookup of test execution session id’s to Websocket connections to Test Executors and delivers messages received over those connections to the Client.
- Dispatcher is responsible for handling client requests to subscribe to Test Execution events stream.
- Kafka provides a scalable message queue between Test Executor and Dispatcher. Since each session id is mapped to a particular partition and each message sent to client includes the current Kafka offset, we can now guarantee reliable delivery of messages to clients with support for replay of messages in case of network reconnection.
- Multiple clients can subscribe to the same stream without additional overhead and admin users can view/monitor remote users test execution in real time.
- The same stream is consumed for analytics purposes as well.
- Throughput/Latency: during load testing, we could get ~90-100ms latency per message consistently with 100 concurrent users (our test setup was 6 brokers deployed on 6 d2.xlarge instances). In our production system, latency is often lower due to batching.
(Engineers who worked on this project: Jwalant Shah, Joshua Hua, Matt Sun)
Tracking down the Villains: Outlier Detection at Netflix
Shadows in the Glass
Finding a Rabbit in a Snowstorm
How DBSCAN Works
How We Use DBSCAN
- email or page a service owner
- remove the server from service without terminating it
- gather forensic data for investigation
- terminate the server to allow the auto scaling group to replace it
Parameter Selection
Into the Ring
Server Count | Precision | Recall | F-score |
1960 | 93% | 87% | 90% |
The Ones We Leave Behind
World on Fire
- Analysis and tuning of service thresholds and timeouts
- Automated canary analysis
- Shifting traffic in response to region-wide outages
- Automated performance tests that tune our autoscaling rules
Java in Flames
Example
This shows CPU consumption by a Java process, both user- and kernel-level, during a vert.x benchmark:

Click to zoom (SVG, PNG). Showing all CPU usage with Java context is amazing and useful. On the top right you can see a peak of kernel code (colored red) for performing a TCP send (which often leads to a TCP receive while handling the send). Beneath it (colored green) is the Java code responsible. In the middle (colored green) is the Java code that is running on-CPU. And in the bottom left, a small yellow tower shows CPU time spent in GC.
We've already used Java flame graphs to quantify performance improvements between frameworks (Tomcat vs rxNetty), which included identifying time spent in Java code compilation, the Java code cache, other system libraries, and differences in kernel code execution. All of these CPU consumers were invisible to other Java profilers, which only focus on the execution of Java methods.
Flame Graph Interpretation
If you are new to flame graphs: The y axis is stack depth, and the x axis spans the sample population. Each rectangle is a stack frame (a function), where the width shows how often it was present in the profile. The ordering from left to right is unimportant (the stacks are sorted alphabetically).
In the previous example, color hue was used to highlight different code types: green for Java, yellow for C++, and red for system. Color intensity was simply randomized to differentiate frames (other color schemes are possible).
You can read the flame graph from the bottom up, which follows the flow of code from parent to child functions. Another way is top down, as the top edge shows the function running on CPU, and beneath it is its ancestry. Focus on the widest functions, which were present in the profile the most. See the CPU flame graphs page for more about interpretation, and Brendan's USENIX/LISA'13 talk (video).
The Problem with Profilers
In order to generate flame graphs, you need a profiler that can sample stack traces. There have historically been two types of profilers used on Java:
- System profilers: such as Linux perf_events, which can profile system code paths, including libjvm internals, GC, and the kernel, but not Java methods.
- JVM profilers: such as hprof, Lightweight Java Profiler (LJP), and commercial profilers. These show Java methods, but not system code paths.
Ideally, we would have one flame graph that shows it all: system and Java code together.
A system profiler like Linux perf_events should be well suited to this task as it can interrupt any software asynchronously and capture both user- and kernel-level stacks. However, system profilers don't work well with Java. The problem is shown by the flame graph on the right. The Java stacks and method names are missing.
There were two specific problems to solve:
- The JVM compiles methods on the fly (just-in-time: JIT), and doesn't expose a symbol table for system profilers.
- The JVM also uses the frame pointer register on x86 (RBP on x86-64) as a general-purpose register, breaking traditional stack walking.
Fixing Symbols
In 2009, Linux perf_events added JIT symbol support, so that symbols from language virtual machines like the JVM could be inspected. To use it, your application creates a /tmp/perf-PID.map text file, which lists symbol addresses (in hex), sizes, and symbol names. perf_events looks for this file by default and, if found, uses it for symbol translations.
Java can create this file using perf-map-agent, an open source JVMTI agent written by Johannes Rudolph. The first version needed to be attached on Java startup, but Johannes enhanced it to attach later on demand and take a symbol dump. That way, we only load it if we need it for a profile. Thanks, Johannes!
Since symbols can change slightly during the profile (we’re typically profiling for 30 or 60 seconds), a symbol dump may include stale symbols. We’ve looked at taking two symbol dumps, before and after the profile, to highlight any such differences. Another approach in development involves a timestamped symbol log to ensure that all translations are accurate (although this requires always-on logging of symbols). So far symbol churn hasn’t been a large problem for us, after Java and JIT have “warmed up” and symbol churn is minimal (this can take a few minutes, given sufficient load). We do bear it in mind when interpreting flame graphs.
Fixing Frame Pointers
For many years the gcc compiler has reused the frame pointer as a compiler optimization, breaking stack traces. Some applications compile with the gcc option -fno-omit-frame-pointer, to preserve this type of stack walking, however, the JVM had no equivalent option. Could the JVM be modified to support this?
Brendan was curious to find out, and hacked a working prototype for OpenJDK. It involved dropping RBP from eligible register pools, eg (diff):
--- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 02:52:11.000000000 +0000... and then fixing the function prologues to store the stack pointer (rsp) into the frame pointer (base pointer) register (rbp):
+++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 01:10:49.686044933 +0000
@@ -166,10 +166,9 @@
// 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ )
//
-// Class for all pointer registers (including RSP)
+// Class for all pointer registers (including RSP, excluding RBP)
reg_class any_reg(RAX, RAX_H,
RDX, RDX_H,
- RBP, RBP_H,
RDI, RDI_H,
RSI, RSI_H,
RCX, RCX_H,
--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04 02:52:11.000000000 +0000It worked. Here are the before and after flame graphs. Brendan posted it, with example flame graphs, to the hotspot compiler devs mailing list. This feature request became JDK-8068945 for JDK9 and JDK-8072465 for JDK8.
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 23:57:11.589593723 +0000
@@ -5236,6 +5236,7 @@
// We always push rbp, so that on return to interpreter rbp, will be
// restored correctly and we can correct the stack.
push(rbp);
+ mov(rbp, rsp);
// Remove word for ebp
framesize -= wordSize;
Fixing this properly involved a lot more work (see discussions in the bugs and mailing list). Zoltán Majó, of Oracle, took this on and rewrote the patch. After testing, it was finally integrated into the early access releases of both JDK9 and JDK8 (JDK8 update 60 build 19), as the new JDK option: -XX:+PreserveFramePointer.
Many thanks to Zoltán, Oracle, and the other engineers who helped get this done!
Since use of this mode disables a compiler optimization, it does decrease performance slightly. We've found in tests that this costs between 0 and 3% extra CPU, depending on the workload. See JDK-8068945 for some additional benchmarking details. There are also other techniques for walking stacks, some with zero run time cost to make available, however, there are other downsides with these approaches.
Instructions
The following steps describe how these flame graphs can be created. We’re working on improving and automating these steps using Vector (more on that in a moment).
There are four components to install:
Linux perf_events
This is the standard Linux profiler, aka “perf” after its front end, and is included in the Linux source (tools/perf). Try running perf help to see if it is installed; if not, your distro may suggest how to get it, usually by adding a perf-tools-common package.
Java 8 update 60 build 19 (or newer)
This includes the frame pointer patch fix (JDK-8072465), which is necessary for Java stack profiling. It is currently released as early access (built from OpenJDK).
perf-map-agent
This is a JVMTI agent that provides Java symbol translation for perf_events is on github. Steps to build this typically involve:
apt-get install cmakeThe current version of perf-map-agent can be loaded on demand, after Java is running.
export JAVA_HOME=/path-to-your-new-jdk8
git clone --depth=1 https://github.com/jrudolph/perf-map-agent
cd perf-map-agent
cmake .
make
WARNING: perf-map-agent is experimental code – use at your own risk, and test before use!
FlameGraph
This is some Perl software for generating flame graphs. It can be fetched from github:
git clone --depth=1 https://github.com/brendangregg/FlameGraphThis contains stackcollapse-perf.pl, for processing perf_events profiles, and flamegraph.pl, for generating the SVG flame graph.
2. Configure Java
Java needs to be running with the -XX:+PreserveFramePointer option, so that perf_events can perform frame pointer stack walks. As mentioned earlier, this can cost some performance, between 0 and 3% depending on the workload.
3a. Generate System Wide Flame Graphs
With this software and Java running with frame pointers, we can profile and generate flame graphs.
For example, taking a 30-second profile at 99 Hertz (samples per second) of all processes, then caching symbols for Java PID 1690, then generating a flame graph:
sudo perf record -F 99 -a -g -- sleep 30The attach-main.jar file is from perf-map-agent, and stackcollapse-perf.pl and flamegraph.pl are from FlameGraph. Specify their full paths unless they are in the current directory.
java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar net.virtualvoid.perf.AttachOnce 1690 # run as same user as java
sudo chown root /tmp/perf-*.map
sudo perf script | stackcollapse-perf.pl | \
flamegraph.pl --color=java --hash > flamegraph.svg
These steps address some quirky behavior involving user permissions: sudo perf script only reads symbol files the current user (root) owns, and, perf-map-agent creates files with the same user ownership as the Java process, which for us is usually non-root. This means we have to change the ownership to root for the symbol file, and then run perf script.
With jmaps
Dealing with symbol files has become a chore, so we’ve been automating it. Here’s one example: jmaps, which can be used like so:
sudo perf record -F 99 -a -g -- sleep 30; sudo jmapsjmaps creates symbol files for all Java processes, with root ownership. You may want to write a similar “jmaps” helper for your environment (our jmaps example is unsupported). Remember to clean up the /tmp symbol files when you no longer need them!
sudo perf script | stackcollapse-perf.pl | \
flamegraph.pl --color=java --hash > flamegraph.svg
3b. Generate By-Process Flame Graphs
The previous procedure grouped Java processes together. If it is important to separate them (and, on some of our instances, it is), you can modify the procedure to generate a by-process flame graph. Eg (with jmaps):
sudo perf record -F 99 -a -g -- sleep 30; sudo jmapsThe output of stackcollapse-perf.pl formats each stack as a single line, and is great food for grep/sed/awk. For the flamegraph at the top of this post, we used the above procedure, and added “| grep java-339” before the “| flamegraph.pl”, to isolate that one process. You could also use a “| grep -v cpu_idle”, to exclude the kernel idle threads.
sudo perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace | \
stackcollapse-perf.pl --pid | \
flamegraph.pl --color=java --hash > flamegraph.svg
Missing Frames
If you start using these flame graphs, you’ll notice that many Java frames (methods) are missing. Compared to the jstack(1) command line tool, the stacks seen in the flame graph look perhaps one third as deep, and are missing many frames. This is because of inlining, combined with this type of profiling (frame pointer based) which only captures the final executed code.
This hasn’t been much of a problem so far: even when many frames are missing, enough remain that we can figure out what’s going on. We’ve also experimented with reducing the amount of inlining, eg, using -XX:InlineSmallCode=500, to increase the number of frames in the profile. In some cases this even improves performance slightly, as the final compiled instruction size is reduced, fitting better into the processor caches (we confirmed this using perf_events separately).
Another approach is to use JVMTI information to unfold the inlined symbols. perf-map-agent has a mode to do this; however, Min Zhou from LinkedIn has experienced Java crashes when using this, which he has been fixing in his version. We’ve not seen these crashes (as we rarely use that mode), but be warned.
The previous steps for generating flame graphs are a little tedious. As we expect these flame graphs will become an everyday tool for Java developers, we’ve looked at making them as easy as possible: a point-and-click interface. We’ve been prototyping this with our open source instance analysis tool: Vector.
Vector was described in more details in a previous techblog post. It provides a simple way for users to visualize and analyze system and application-level metrics in near real-time, and flame graphs is a great addition to the set of functionalities it already provides.
We tried to keep the user interaction as simple as possible. To generate a flame graph, you connect Vector to the target instance, add the flame graph widget to the dashboard, then click the generate button. That's it!
Behind the scenes, Vector requests a flame graph from a custom instance agent that we developed, which also supplies Vector's other metrics. Vector checks the status of this request while fetching and displaying other metrics, and displays the flame graph when it is ready.
Our custom agent is not generic enough to be used by everyone yet (it depends on the Netflix environment), so we have yet to open-source it. If you're interested in testing or extending it, reach out to us.
We have some enhancements planned. One is for regression analysis, by automatically collecting flame graphs over different days and generating flame graph differentials for them. This will help us quickly understand changes in CPU usage due to software changes.
Apart from CPU profiling, perf_events can also trace user- and kernel-level events, including disk I/O, networking, scheduling, and memory allocation. When these are synchronously triggered by Java, a mixed-mode flame graph will show the code paths that led to these events. A page fault mixed-mode flame graph, for example, can be used to show which Java code paths led to an increase in main memory usage (RSS).
We also want to develop enhancements for flame graphs and Vector, including real time updates. For this to work, our agent will collect perf_events directly and return a data structure representing the partial flame graph to Vector with every check. Vector, with this information, will be able to assemble the flame graph in real time, while the profile is still being collected. We are also investigating using D3 for flame graphs, and adding interactivity improvements.
Other Work
Twitter have also explored making perf_events and Java work better together, which Kaushik Srenevasan summarized in his Tracing and Profiling talk from OSCON 2014 (slides). Kaushik showed that perf_events has much lower overhead when compared to some other Java profilers, and included a mixed-mode stack trace from perf_events. David Keenan from Twitter also described this work in his Twitter-Scale Computing talk (video), as well as summarizing other performance enhancements they have been making to the JVM.
At Google, Stephane Eranian has been working on perf_events and Java as well and has posted a patch series that supports a timestamped JIT symbol transaction log from Java for accurate symbol translation, solving the stale symbol problem. It’s impressive work, although a downside with the logging technique may be the performance cost of always logging symbols even if a profiler is never used.
Conclusion
CPU mixed-mode flame graphs help identify and quantify all CPU consumers. They show the CPU time spent in Java methods, system libraries, and the kernel, all in one visualization. This reveals CPU consumers that are invisible to other profilers, and have so far been used to identify issues and explain performance changes between software versions.
These mixed-mode flame graphs have been made possible by a new option in the JVM: -XX:+PreserveFramePointer, available in early access releases. In this post we described how these work, the challenges that were addressed, and provided instructions for their generation. Similar visibility for Node.js was described in our earlier post: Node.js in Flames.
by Brendan Gregg and Martin Spier
Tuning Tomcat For A High Throughput, Fail Fast System
Problem
Netflix has a number of high throughput, low latency mid tier services. In one of these services, it was observed that in case there is a huge surge in traffic in a very short span of time, the machines became cpu starved and would become unresponsive. This would lead to a bad experience for the clients of this service. They would get a mix of read and connect timeouts. Read timeouts can be particularly bad if the read timeouts are set to be very high. The client machines will wait to hear from the server for a long time. In case of SOA, this can lead to a ripple effect as the clients of these clients will also start getting read timeouts and all services can slow down. Under normal circumstances, the machines had ample amount of cpu free and the service was not cpu intensive. So, why does this happen? In order to understand that, let's first look at the high level stack for this service. The request flow would look like this
![]()
On simulating the traffic surge in the test environment it was found that the reason for cpu starvation was improper apache and tomcat configuration. On a sudden increase in traffic, multiple apache workers became busy and a very large number of tomcat threads also got busy. There was a huge jump in system cpu as none of the threads could do any meaningful work since most of the time cpu would be context switching.
Solution
Since this was a mid tier service, there was not much use of apache. So, instead of tuning two systems (apache and tomcat), it was decided to simplify the stack and get rid of apache. To understand why too many tomcat threads got busy, let's understand the tomcat threading model.
High Level Threading Model for Tomcat Http Connector
Tomcat has an acceptor thread to accept connections. In addition, there is a pool of worker threads which do the real work. The high level flow for an incoming request is:- TCP handshake between OS and client for establishing a connection. Depending on the OS implementation there can be a single queue for holding the connections or there can be multiple queues. In case of multiple queues, one holds incomplete connections which have not yet completed the tcp handshake. Once completed, connections are moved to the completed connection queue for consumption by the application. "acceptCount" parameter in tomcat configuration is used to control the size of these queues.
- Tomcat acceptor thread accepts connections from the completed connection queue.
- Checks if a worker thread is available in the free thread pool. If not, creates a worker thread if the number of active threads < maxThreads. Else wait for a worker thread to become free.
- Once a free worker thread is found, acceptor thread hands the connection to it and gets back to listening for new connections.
- Worker thread does the actual job of reading input from the connection, processing the request and sending the response to the client. If the connection was not keep alive then it closes the connection and places itself in the free thread pool. For a keep alive connection, waits for more data to be available on the connection. In case data does not become available until keepAliveTimeout, closes the connection and makes itself available in the free thread pool.
In case the number of tomcat threads and acceptCount values are set to be too high, a sudden increase in traffic will fill up the OS queues and make all the worker threads busy. When more requests than that can be handled by the system are sent to the machines, this "queuing" of requests is inevitable and will lead to increased busy threads, causing cpu starvation eventually. Hence, the crux of the solution is to avoid too much queuing of requests at multiple points (OS and tomcat threads) and fail fast (return http status 503) as soon the application's maximum capacity is reached. Here is a recommendation for doing this in practice:
Fail fast in case the system capacity for a machine is hit
Estimate the number of threads expected to be busy at peak load. If the server responds back in 5 ms on avg for a request, then a single thread can do a max of 200 requests per second (rps). In case the machine has a quad core cpu, it can do max 800 rps. Now assume that 4 requests (since the assumption is that the machine is a quad core) come in parallel and hit the machine. This will make 4 worker threads busy. For the next 5 ms all these threads will be busy. The total rps to the system is the max value of 800, so in next 5 ms, 4 more requests will come and make another 4 threads busy. Subsequent requests will pick up one of the already busy threads which has become free. So, on an average there should not be more than 8 threads busy at 800 rps. The behavior will be a little different in practice because all system resources like cpu will be shared. Hence one should experiment for the total throughput the system can sustain and do a calculation for expected number of busy threads. This will provide a base line for the number of threads needed to sustain peak load. In order to provide some buffer lets more than triple the number of max threads needed to 30. This buffer is arbitrary and can be further tuned if needed. In our experiments we used a slightly more than 3 times buffer and it worked well.
Track the number of active concurrent requests in memory and use it for fast failing. If the number of concurrent requests is near the estimated active threads (8 in our example) then return an http status code of 503. This will prevent too many worker threads becoming busy because once the peak throughput is hit, any extra threads which become active will be doing a very light weight job of returning 503 and then be available for further processing.
Configure Operating System parameters
The acceptCount parameter for tomcat dictates the length of the queues at the OS level for completing tcp handshake operations (details are OS specific). It's important to tune this parameter, otherwise one can have issues with establishing connections to the machine or it can lead to excessive queuing of connections in OS queues which will lead to read timeouts. The implementation details of handling incomplete and complete connections vary across OS. There can be a single queue of connections or multiple queues for incomplete and complete connections (please refer to the References section for details). So, a nice way to tune the acceptCount parameter is to start with a small value and keep increasing it unless the connection errors get removed.
Having too large a value for acceptCount means that the incoming requests can get accepted at the OS level. However, if the incoming rps is more than what a machine can handle, all the worker threads will eventually become busy and then the acceptor thread will wait for a worker thread to become free. More requests will continue to pile up in the OS queues since acceptor thread will consume them only when a worker thread becomes available. In the worst case, these requests will timeout while waiting in the OS queues, but will still be processed by the server once they get picked by the tomcat's acceptor thread. This is a complete waste of processing resources as a client will never receive any response.
If the value of acceptCount is too small, then in case of a high rps there will not be enough space for OS to accept connections and make it available for the acceptor thread. In this case, connect timeout errors will be returned to the client way below the actual throughput for the server is reached.
Hence experiment by starting with a small value like 10 for acceptCount and keep increasing it until there are are no connection errors from the server.
On doing both the changes above, even if all the worker threads become busy in the worst case, the servers will not be cpu starved and will be able to do as much work as possible (max throughput).
Other considerations
As explained above, each incoming connection is ultimately handled to a worker tomcat thread. In case http keep alive is turned on, a worker thread will continue to listen on a connection and will not be available in the free thread pool. So, if the clients are not smart to close the connection once it's not being actively used, the server can very easily run out of worker threads. If keep alive is turned on then one has to size the server farm by keeping this constraint in mind.
Alternatively, if keep alive is turned off then one does not have to worry about the problem of inactive connections using worker threads. However, in this case on each call one has to pay the price of opening and closing the connection. Further, this will also create a lot of sockets in the TIME_WAIT state which can put pressure on the servers.
Its best to pick the choice based on the use cases for the application and to test the performance by running experiments.
Results
Multiple experiments were run with different configurations. The results are shown below. The dark blue line is the original configuration with apache and tomcat. All the other are different configurations for the stack with only tomcat
![]()
ThroughputNote the drop after a sustained period of traffic higher than what can be served by server.
![]()
Busy Apache Workers
![]()
Idle cpuNote that the original configuration got so busy that it was not even able to publish the stats for idle cpu on a continuous basis. The stats were published (valued 0) for the base configuration intermittently as highlighted in the red circles
![]()
Server average latency to process a request
Note
Its possible to achieve the same results by tuning the combination of apache and tomcat to work together. However, since there was not much use of apache for our service, we found the above model simpler with one less moving part. It's best to make choices by a combination of understanding the system and use of experimentation and testing in a real-world environment to verify hypothesis.
References
- https://books.google.com/books/about/UNIX_Network_Programming.html?id=ptSC4LpwGA0C&source=kp_cover&hl=en
- http://www.sean.de/Solaris/soltune.html
- https://tomcat.apache.org/tomcat-7.0-doc/config/http.html
- http://grepcode.com/project/repository.springsource.com/org.apache.coyote/com.springsource.org.apache.coyote/
Acknowledgment
I would like to thank Mohan Doraiswamy for his suggestions in this effort.
Netflix at Velocity 2015: Linux Performance Tools
In this tutorial I summarize traditional and advanced performance tools, including: top, ps, vmstat, iostat, mpstat, free, strace, tcpdump, netstat, nicstat, pidstat, swapon, lsof, sar, ss, iptraf, iotop, slaptop, pcstat, tiptop, rdmsr, lmbench, fio, pchar, perf_events, ftrace, SystemTap, ktap, sysdig, and eBPF; and reference many more. I also include updated tools diagrams for observability, sar, benchmarking, and tuning (including the image above).
This tutorial can be shared with a wide audience – anyone working on Linux systems – as a free crash course on Linux performance tools. I hope people enjoy it and find it useful. Here's the playlist.
Part 1 (youtube) (54 mins):
Part 2 (youtube) (45 mins):
Slides (slideshare):
At Netflix, we have Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. Much of the time we don't need to login to instances directly, but when we do, this tutorial covers the tools we use.
Thanks to O'Reilly for hosting a great conference, and those who attended.
If you are passionate about the content in this tutorial, we're hiring, particularly for senior SREs and performance engineers: see Netflix jobs!
Making Netflix.com Faster
Simply put, performance matters. We know members want to immediately start browsing or watching their favorite content and have found that faster startup leads to more satisfying usage. So, when building the long-awaited update to netflix.com, the Website UI Engineering team made startup performance a first tier priority.
The impact of this effort netted a 70% reduction in startup time, and was focused in three key areas:
- Server and Client Rendering
- Universal JavaScript
- JavaScript Payload Reductions
Server and Client Rendering
The netflix.com legacy website stack had a hard separation between server markup and client enhancement. This was primarily due to the different programming languages used in each part of our application. On the server, there was Java with Tomcat, Struts and Tiles. On the browser client, we enhanced server-generated markup with JavaScript, primarily via jQuery.
This separation led to undesirable results in our startup time. Every time a visitor came to any page on netflix.com our Java tier would generate the majority of the response needed for the entire page's lifetime and deliver it as HTML markup. Often, users would be waiting for the generation of markup for large parts of the page they would never visit.
Our new architecture renders only a small amount of the page's markup, bootstrapping the client view. We can easily change the amount of the total view the server generates, making it easy to see the positive or negative impact. The server requires less data to deliver a response and spends less time converting data into DOM elements. Once the client JavaScript has taken over, it can retrieve all additional data for the remainder of the current and future views of a session on demand. The large wins here were the reduction of processing time in the server, and the consolidation of the rendering into one language.
We find the flexibility afforded by server and client rendering allows us to make intelligent choices of what to request and render in the server and the client, leading to a faster startup and a smoother transition between views.
Universal JavaScript
In order to support identical rendering on the client and server, we needed to rethink our rendering pipeline. Our previous architecture's separation between the generation of markup on the server and the enhancement of it on the client had to be dropped.
Three large pain points shaped our new Node.js architecture:
- Context switching between languages was not ideal.
- Enhancement of markup required too much direct coupling between server-only code generating markup and the client-only code enhancing it.
- We’d rather generate all our markup using the same API.
There are many solutions to this problem that don't require Universal JavaScript, but we found this lesson was most appropriate: When there are two copies of the same thing, it's fairly easy for one to be slightly different than the other. Using Universal JavaScript means the rendering logic is simply passed down to the client.
Node.js and React.js are natural fits for this style of application. With Node.js and React.js, we can render from the server and subsequently render changes entirely on the client after the initial markup and React.js components have been transmitted to the browser. This flexibility allows for the application to render the exact same output independent of the location of the rendering. The hard separation is no longer present and it's far less likely for the server and client to be different than one another.
Without shared rendering logic we couldn't have realized the potential of rendering only what was necessary on startup and everything else as data became available.
Reduce JavaScript Payload Impact
Building rich interactive experiences on the web often translates into a large JavaScript payload for users. In our new architecture, we placed significant emphasis on pruning large dependencies we can knowingly replace with smaller modules and delivering JavaScript only applicable for the current visitor.
Many of the large dependencies we relied on in the legacy architecture didn't apply in the new one. We've replaced these dependencies in favor of newer, more efficient libraries. Replacing these libraries resulted in a much smaller JavaScript payload, meaning members need less JavaScript to start browsing. We know there is significant work remaining here, and we're actively working to trim our JavaScript payload down further.
Time To Interactive
In order to test and understand the impact of our choices, we monitor a metric we call time to interactive (tti).
Amount of time spent between first known startup of the application platform and when the UI is interactive regardless of view. Note that this does not require that the UI is done loading, but is the first point at which the customer can interact with the UI using an input device.
For applications running inside a web browser, this data is easily retrievable from the Navigation Timing API (where supported).
Work is Ongoing
We firmly believe high performance is not an optional engineering goal – it's a requirement for creating great user-experiences. We have made significant strides in startup performance, and are committed to challenging our industry’s best-practices in the pursuit of a better experience for our members.
Over the coming months we'll be investigating Service Workers, ASM.js, Web Assembly, and other emerging web standards to see if we can leverage them for a more performant website experience. If you’re interested in helping create and shape the next generation of performant web user-experiences apply here.
RAD - Outlier Detection on Big Data
As we built RAD we identified four generic challenges that are ubiquitous in outlier detection on “big data.”
- High cardinality dimensions: High cardinality data sets - especially those with large combinatorial permutations of column groupings - makes human inspection impractical.
- Minimizing False Positives: A successful anomaly detection tool must minimize false positives. In our experience there are many alerting platforms that “sound an alarm” that goes ultimately unresolved. The goal is to create alerting mechanisms that can be tuned to appropriately balance noise and information.
- Seasonality: Hourly/Weekly/Bi-weekly/Monthly seasonal effects are common and can be mis-identified as outliers deserving attention if not handled properly. Seasonal variability needs to be ignored.
- Data is not always normally distributed: This has been a particular challenge since Netflix has been growing over the last 24 months. Generally though, an outlier tool must be robust so that it works on data that is not normally distributed.
Algorithm
The algorithm we finally settled on uses Robust Principal Component Analysis (RPCA) to detect anomalies. PCA uses the Singular Value Decomposition (SVD) to find low rank representations of the data. The robust version of PCA (RPCA) identifies a low rank representation, random noise, and a set of outliers by repeatedly calculating the SVD and applying “thresholds” to the singular values and error for each iteration. For more information please refer to the original paper by Candes et al. (2009).
Below is an interactive visualization of the algorithm at work on a simple/random dataset and on public climate data.
Pig Wrapper
Business Application
Netflix processes millions of transactions every day across tens of thousands of banking institutions/infrastructures in both real-time and batch environments. We’ve used the above solution to detect anomalies in failures in the payment network at a bank level. With the above system, business managers were able to follow up with their counterparts in the payment industry and thereby reducing the impact on Netflix customers
Our signup flow was another important point of application. Today Netflix customers sign up across the world on hundreds of different types of browsers or devices. Identifying anomalies across unique combinations of country, browser/device and language helps our engineers understand and react to customer sign up problems in a timely manner.
Conclusion
Introducing Surus and ScorePMML
ScorePMML
- Someone proposes an idea and builds a model on “small” data
- We decide to “scale-up” the prototype to see how well the model generalizes to a larger dataset
- We may eventually put the model into “production”
An Example
# Required Dependencies require(randomForest) require(gbm) require(pmml) require(XML) data(iris) # Column Names must NOT contain periods names(iris) <- gsub("\\.","_",tolower(names(iris))) # Build Models iris.rf <- randomForest(Species ~ ., data=iris, ntree=5) iris.gbm <- gbm(Species ~ ., data=iris, n.tree=5) # Convert to pmml # Output to File saveXML(pmml(iris.rf) ,file="~/iris.rf.xml") saveXML(pmml(iris.gbm, n.trees=5),file="~/iris.gbm.xml") |
REGISTER '~/scoring.jar'; DEFINE pmmlRF com.netflix.pmml.ScorePMML('~/iris.rf.xml'); DEFINE pmmlGBM com.netflix.pmml.ScorePMML('~/iris.gbm.xml'); -- LOAD Data iris = load '~/iris.csv' using PigStorage(',') as (sepal_length,sepal_width,petal_length,petal_width,species); -- Score two models in one pass over the data scored = foreach iris generate pmmlRF(*) as RF, pmmlGBM(*) as GBM; dump scored; |
- We throw a Pig FrontendException when the Pig/Hive data types and column names don’t match the data types and column names in PMML. This means that you don’t need to wait for the Hadoop MR job to start before getting the feedback that something is wrong.
- The ScorePMML constructor accepts local or remote file locations. This means that you can reference an HDFS or S3 path, or you can reference a local path (see the example above).
- We’ve made scoring multiple models in parallel trivial. Furthermore, models are only read into memory once, so there isn’t a penalty when processing multiple models at the same time.
- When scoring big (and usually uncontrolled) datasets it’s important to handle errors gracefully. You don’t want to rescore 100 records because you fail on the 101st record. Rather than throwing an exception (and failing the job) we’ve added an indicator to the output tuple that can be used for alerting.
- Although this is currently written to be run in Pig we may migrate in the future to different platforms.
Conclusion
Known Issues/Limitations
- ScorePMML is built on jPMML 1.0.19, which doesn’t fully support the 4.2 PMML specification (as defined by the Data Mining Group). At the time of this writing not all enumerated missing value strategies are supported. This caused problems when we wanted to implement GBMs in PMML, so we had to add extra nodes in each tree to properly handle missing values.
- Hive 0.12.0 (and thus Pig) has strict naming conventions for columns/relations which are relaxed in PMML. Non alpha-numeric characters in column names are not supported in ScorePMML. Please see the Hive documentation for more details on column naming in the Hive metastore.
Additional Resources
- The Data Mining Group PMML Spec: The 4.1.2 specification is currently supported. The 4.2 version of the PMML spec is not currently supported. The DMG page will give you a sense of which model types are supported and how they are described in PMML.
- RPMML: An R-package for creating PMML files from common predictive modeling objects.
Netflix's Viewing Data: How We Know Where You Are in House of Cards
Use Cases
What titles have I watched?
Where did I leave off in a given title?
What else is being watched on my account right now?
Current Architecture
Current Architecture Diagram
Breaking Points
Next Generation Architecture
- Availability over consistency - our primary use cases can tolerate eventually consistent data, so design from the start favoring availability rather than strong consistency in the face of failures.
- Microservices - Components that were combined together in the stateful architecture should be separated out into services (components as services).
- Components are defined according to their primary purpose - either collection, processing, or data providing.
- Delegate responsibility for state management to the persistence tiers, keeping the application tiers stateless.
- Decouple communication between components by using signals sent through an event queue.
- Polyglot persistence - Use multiple persistence technologies to leverage the strengths of each solution.
- Achieve flexibility + performance at the cost of increased complexity.
- Use Cassandra for very high volume, low latency writes. A tailored data model and tuned configuration enables low latency for medium volume reads.
- Use Redis for very high volume, low latency reads. Redis’ first-class data type support should support writes better than how we did read-modify-writes in memcached.
Netflix Likes React
Startup Speed
Runtime Performance
Modularity
Advantages of React
Isomorphic JavaScript
Virtual DOM
React Components and Mixins
By Jordanna Kwok
SPS : the Pulse of Netflix Streaming
Creating the Right Signal
Deviation Detection Models
Static Thresholds
Exponential Smoothing
Double Exponential Smoothing
Advanced Techniques
Conclusion
References
Nicobar: Dynamic Scripting Library for Java
The Netflix API is the front door to the streaming service, handling billions of requests per day from more than 1000 different device types around the world. To provide the best experience to our subscribers, it is critical that our UI teams have the ability to innovate at a rapid pace. As described in our blog post a year ago, we developed a Dynamic Scripting Platform that enables this rapid innovation.
Today, we are happy to announce Nicobar, the open source script execution library that allows our UI teams to inject UI-specific adapter code dynamically into our JVM without the API team’s involvement. Named after a remote archipelago in the eastern Indian Ocean, Nicobar allows each UI team to have its own island of code to optimize the client/server interaction for each device, evolved at its own pace.
Background
As of this post’s writing, a single Netflix API instance hosts hundreds of UI scripts, developed by a dozen teams. Together, they deploy anywhere between a handful to a hundred UI scripts per day. A strong, core scripting library is what allows the API JVM to handle this rate of deployment reliably and efficiently.Our success with the scripting approach in the API platform led us to identify other applications that could benefit also from the ability to alter their behavior without a full scale deployment. Nicobar is a library that provides this functionality in a compact and reusable manner, with pluggable support for JVM languages.
Architecture Overview
Early implementations of dynamic scripting at Netflix used basic java classloader technology to host and sandbox scripts from one another. While this was a good start, it was not nearly enough. Standard Java classloaders can have only one parent, and thus allow only simple, flattened hierarchies. If one wants to share classloaders, this is a big limitation and an inefficient use of memory. Also, code loaded within standard classloaders is fully visible to downstream classloaders. Finer-grained visibility controls are helpful in restricting what packages are exported and imported into classloaders.Given these experiences, we designed into Nicobar a script module loader that holds a graph of inter-dependent script modules. Under the hood, we use JBoss Modules (which is open source) to create java modules. JBoss modules represent powerful extensions to basic Java classloaders, allowing for arbitrarily complex classloader dependency graphs, including multiple parents. They also support sophisticated package filters that can be applied to incoming and outgoing dependency edges.
A script module provides an interface to retrieve the list of java classes held inside it. These classes can be instantiated and methods exercised on the instances, thereby “executing” the script module.
Script source and resource bundles are represented by script archives. Metadata for the archives is defined in the form of a script module specification, where script authors can describe the content language, inter-module dependencies, import and export filters for packages, as well as user specific metadata.
Script archive contents can be in source form and/or in precompiled form (.class files). At runtime, script archives are converted into script modules by running the archive through compilers and loaders that translate any source found into classes, and then loading up all classes into a module. Script compilers and loaders are pluggable, and out of the box, Nicobar comes with compilation support for Groovy 2, as well as a simple loader for compiled java classes.
Archives can be stored into and queried from archive repositories on demand, or via a continuous repository poller. Out of the box, Nicobar comes with a choice of file-system based or Cassandra based archive repositories.
As the usage of a scripting system grows in scale, there is often the need for an administrative interface that supports publishing and modifying script archives, as well as viewing published archives. Towards this end, Nicobar comes with a manager and explorer subproject, based on Karyon and Pytheas.
Putting it all together
The diagram below illustrates how all the pieces work together.Usage Example - Hello Nicobar!
Here is an example of initializing the Nicobar script module loader to support Groovy scripts.Create a simple groovy script archive, with the following groovy file:
Add a module specification file moduleSpec.json, along with the source:
Jar the source and module specification together as a jar file. This is your script archive.
Create a script module loader
Create an archive repository
If you have more than a handful of scripts, you will likely need a repository representing the collection. Let’s create a JarArchiveRepository, which is a repository of script archive jars at some file system path. Copy helloworld.jar into /tmp/archiveRepo to match the code below.
Hooking up the repository poller provides dynamic updates of discovered modules into the script module loader. You can wire up multiple repositories to a poller, which would poll them iteratively.
Execute script
Script modules can be retrieved out of the module loader by name (and an optional version). Classes can be retrieved from script modules by name, or by type. Nicobar itself is agnostic to the type of the classes held in the module, and leaves it to the application’s business logic to decide what to extract out and how to execute classes.
Here is an example of extracting a class implementing Callable and executing it:
At this point, any changes to the script archive jar will result in an update of the script module inside the module loader and new classes reflecting the update will be vended seamlessly!
More about the Module Loader
In addition to the ability to dynamically inject code, Nicobar’s module loading system also allows for multiple variants of a script module to coexist, providing for runtime selection of a variant. As an example, tracing code execution involves adding instrumentation code, which adds overhead. Using Nicobar, the application could vend classes from an instrumented version of the module when tracing is needed, while vending classes from the uninstrumented, faster version of the module otherwise. This paves the way for on demand tracing of code without having to add constant overhead on all executions.Module variants can also be leveraged to perform slow rollouts of script modules. When a module deployment is desired, a portion of the control flow can be directed through the new version of the module at runtime. Once confidence is gained in the new version, the update can be “completed”, by flushing out the old version and sending all control flow through the new module.
Static parts of an application may benefit from a modular classloading architecture as well. Large applications, loaded into a monolithic classloader can become unwieldy over time, due to an accumulation of unintended dependencies and tight coupling between various parts of the application. In contrast, loading components using Nicobar modules allows for well defined boundaries and fine-grained isolation between them. This, in turn, facilitates decoupling of components, thereby allowing them to evolve independently
Conclusion
We are excited by the possibilities around creating dynamic applications using Nicobar. As usage of the library grows, we expect to see various feature requests around access controls, additional persistence and query layers, and support for other JVM languages.Project Jigsaw, the JDK’s native module loading system, is on the horizon too, and we are interested in seeing how Nicobar can leverage native module support from Jigsaw.
If these kinds of opportunities and challenges interest you, we are hiring and would love to hear from you!