Foundation Model for Personalized Recommendation

March 28, 2025, 5:51 pm

≫ Next: Globalizing Productions with Netflix’s Media Production Suite

By Ko-Jen Hsiao, Yesu Feng and Sudarshan Lamkhede

Motivation

Netflix’s personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including “Continue Watching” and “Today’s Top Picks for You.” (Refer to our recent overview for more details). However, as we expanded our set of personalization algorithms to meet increasing business needs, maintenance of the recommender system became quite costly. Furthermore, it was difficult to transfer innovations from one model to another, given that most are independently trained despite using common data sources. This scenario underscored the need for a new recommender system architecture where member preference learning is centralized, enhancing accessibility and utility across different models.

Particularly, these models predominantly extract features from members’ recent interaction histories on the platform. Yet, many are confined to a brief temporal window due to constraints in serving latency or training costs. This limitation has inspired us to develop a foundation model for recommendation. This model aims to assimilate information both from members’ comprehensive interaction histories and our content at a very large scale. It facilitates the distribution of these learnings to other models, either through shared model weights for fine tuning or directly through embeddings.

The impetus for constructing a foundational recommendation model is based on the paradigm shift in natural language processing (NLP) to large language models (LLMs). In NLP, the trend is moving away from numerous small, specialized models towards a single, large language model that can perform a variety of tasks either directly or with minimal fine-tuning. Key insights from this shift include:

A Data-Centric Approach: Shifting focus from model-centric strategies, which heavily rely on feature engineering, to a data-centric one. This approach prioritizes the accumulation of large-scale, high-quality data and, where feasible, aims for end-to-end learning.
Leveraging Semi-Supervised Learning: The next-token prediction objective in LLMs has proven remarkably effective. It enables large-scale semi-supervised learning using unlabeled data while also equipping the model with a surprisingly deep understanding of world knowledge.

These insights have shaped the design of our foundation model, enabling a transition from maintaining numerous small, specialized models to building a scalable, efficient system. By scaling up semi-supervised training data and model parameters, we aim to develop a model that not only meets current needs but also adapts dynamically to evolving demands, ensuring sustainable innovation and resource efficiency.

Data

At Netflix, user engagement spans a wide spectrum, from casual browsing to committed movie watching. With over 300 million users at the end of 2024, this translates into hundreds of billions of interactions — an immense dataset comparable in scale to the token volume of large language models (LLMs). However, as in LLMs, the quality of data often outweighs its sheer volume. To harness this data effectively, we employ a process of interaction tokenization, ensuring meaningful events are identified and redundancies are minimized.

Tokenizing User Interactions: Not all raw user actions contribute equally to understanding preferences. Tokenization helps define what constitutes a meaningful “token” in a sequence. Drawing an analogy to Byte Pair Encoding (BPE) in NLP, we can think of tokenization as merging adjacent actions to form new, higher-level tokens. However, unlike language tokenization, creating these new tokens requires careful consideration of what information to retain. For instance, the total watch duration might need to be summed or engagement types aggregated to preserve critical details.

**Figure 1.**Tokenization of user interaction history by merging actions on the same title, preserving important information.

This tradeoff between granular data and sequence compression is akin to the balance in LLMs between vocabulary size and context window. In our case, the goal is to balance the length of interaction history against the level of detail retained in individual tokens. Overly lossy tokenization risks losing valuable signals, while too granular a sequence can exceed practical limits on processing time and memory.

Even with such strategies, interaction histories from active users can span thousands of events, exceeding the capacity of transformer models with standard self attention layers. In recommendation systems, context windows during inference are often limited to hundreds of events — not due to model capability but because these services typically require millisecond-level latency. This constraint is more stringent than what is typical in LLM applications, where longer inference times (seconds) are more tolerable.

To address this during training, we implement two key solutions:

Sparse Attention Mechanisms: By leveraging sparse attention techniques such as low-rank compression, the model can extend its context window to several hundred events while maintaining computational efficiency. This enables it to process more extensive interaction histories and derive richer insights into long-term preferences.
Sliding Window Sampling: During training, we sample overlapping windows of interactions from the full sequence. This ensures the model is exposed to different segments of the user’s history over multiple epochs, allowing it to learn from the entire sequence without requiring an impractically large context window.

At inference time, when multi-step decoding is needed, we can deploy KV caching to efficiently reuse past computations and maintain low latency.

These approaches collectively allow us to balance the need for detailed, long-term interaction modeling with the practical constraints of model training and inference, enhancing both the precision and scalability of our recommendation system.

Information in Each ‘Token’: While the first part of our tokenization process focuses on structuring sequences of interactions, the next critical step is defining the rich information contained within each token. Unlike LLMs, which typically rely on a single embedding space to represent input tokens, our interaction events are packed with heterogeneous details. These include attributes of the action itself (such as locale, time, duration, and device type) as well as information about the content (such as item ID and metadata like genre and release country). Most of these features, especially categorical ones, are directly embedded within the model, embracing an end-to-end learning approach. However, certain features require special attention. For example, timestamps need additional processing to capture both absolute and relative notions of time, with absolute time being particularly important for understanding time-sensitive behaviors.

To enhance prediction accuracy in sequential recommendation systems, we organize token features into two categories:

Request-Time Features: These are features available at the moment of prediction, such as log-in time, device, or location.
Post-Action Features: These are details available after an interaction has occurred, such as the specific show interacted with or the duration of the interaction.

To predict the next interaction, we combine request-time features from the current step with post-action features from the previous step. This blending of contextual and historical information ensures each token in the sequence carries a comprehensive representation, capturing both the immediate context and user behavior patterns over time.

Considerations for Model Objective and Architecture

As previously mentioned, our default approach employs the autoregressive next-token prediction objective, similar to GPT. This strategy effectively leverages the vast scale of unlabeled user interaction data. The adoption of this objective in recommendation systems has shown multiple successes [1–3]. However, given the distinct differences between language tasks and recommendation tasks, we have made several critical modifications to the objective.

Firstly, during the pretraining phase of typical LLMs, such as GPT, every target token is generally treated with equal weight. In contrast, in our model, not all user interactions are of equal importance. For instance, a 5-minute trailer play should not carry the same weight as a 2-hour full movie watch. A greater challenge arises when trying to align long-term user satisfaction with specific interactions and recommendations. To address this, we can adopt a multi-token prediction objective during training, where the model predicts the next n tokens at each step instead of a single token[4]. This approach encourages the model to capture longer-term dependencies and avoid myopic predictions focused solely on immediate next events.

Secondly, we can use multiple fields in our input data as auxiliary prediction objectives in addition to predicting the next item ID, which remains the primary target. For example, we can derive genres from the items in the original sequence and use this genre sequence as an auxiliary target. This approach serves several purposes: it acts as a regularizer to reduce overfitting on noisy item ID predictions, provides additional insights into user intentions or long-term genre preferences, and, when structured hierarchically, can improve the accuracy of predicting the target item ID. By first predicting auxiliary targets, such as genre or original language, the model effectively narrows down the candidate list, simplifying subsequent item ID prediction.

Unique Challenges for Recommendation FM

In addition to the infrastructure challenges posed by training bigger models with substantial amounts of user interaction data that are common when trying to build foundation models, there are several unique hurdles specific to recommendations to make them viable. One of unique challenges is entity cold-starting.

At Netflix, our mission is to entertain the world. New titles are added to the catalog frequently. Therefore the recommendation foundation models require a cold start capability, which means the models need to estimate members’ preferences for newly launched titles before anyone has engaged with them. To enable this, our foundation model training framework is built with the following two capabilities: Incremental training and being able to do inference with unseen entities.

Incremental training : Foundation models are trained on extensive datasets, including every member’s history of plays and actions, making frequent retraining impractical. However, our catalog and member preferences continually evolve. Unlike large language models, which can be incrementally trained with stable token vocabularies, our recommendation models require new embeddings for new titles, necessitating expanded embedding layers and output components. To address this, we warm-start new models by reusing parameters from previous models and initializing new parameters for new titles. For example, new title embeddings can be initialized by adding slight random noise to existing average embeddings or by using a weighted combination of similar titles’ embeddings based on metadata. This approach allows new titles to start with relevant embeddings, facilitating faster fine-tuning. In practice, the initialization method becomes less critical when more member interaction data is used for fine-tuning.
Dealing with unseen entities : Even with incremental training, it’s not always guaranteed to learn efficiently on new entities (ex: newly launched titles). It’s also possible that there will be some new entities that are not included/seen in the training data even if we fine-tune foundation models on a frequent basis. Therefore, it’s also important to let foundation models use metadata information of entities and inputs, not just member interaction data. Thus, our foundation model combines both learnable item id embeddings and learnable embeddings from metadata. The following diagram demonstrates this idea.

**Figure 2.** Titles are associated with various metadata, such as genres, storylines, and tones. Each type of metadata could be represented by averaging its respective embeddings, which are then concatenated to form the overall metadata-based embedding for the title.

To create the final title embedding, we combine this metadata-based embedding with a fully-learnable ID-based embedding using a mixing layer. Instead of simply summing these embeddings, we use an attention mechanism based on the “age” of the entity. This approach allows new titles with limited interaction data to rely more on metadata, while established titles can depend more on ID-based embeddings. Since titles with similar metadata can have different user engagement, their embeddings should reflect these differences. Introducing some randomness during training encourages the model to learn from metadata rather than relying solely on ID embeddings. This method ensures that newly-launched or pre-launch titles have reasonable embeddings even with no user interaction data.

Downstream Applications and Challenges

Our recommendation foundation model is designed to understand long-term member preferences and can be utilized in various ways by downstream applications:

Direct Use as a Predictive Model The model is primarily trained to predict the next entity a user will interact with. It includes multiple predictor heads for different tasks, such as forecasting member preferences for various genres. These can be directly applied to meet diverse business needs..
Utilizing embeddings The model generates valuable embeddings for members and entities like videos, games, and genres. These embeddings are calculated in batch jobs and stored for use in both offline and online applications. They can serve as features in other models or be used for candidate generation, such as retrieving appealing titles for a user. High-quality title embeddings also support title-to-title recommendations. However, one important consideration is that the embedding space has arbitrary, uninterpretable dimensions and is incompatible across different model training runs. This poses challenges for downstream consumers, who must adapt to each retraining and redeployment, risking bugs due to invalidated assumptions about the embedding structure. To address this, we apply an orthogonal low-rank transformation to stabilize the user/item embedding space, ensuring consistent meaning of embedding dimensions, even as the base foundation model is retrained and redeployed.
Fine-Tuning with Specific Data The model’s adaptability allows for fine-tuning with application-specific data. Users can integrate the full model or subgraphs into their own models, fine-tuning them with less data and computational power. This approach achieves performance comparable to previous models, despite the initial foundation model requiring significant resources.

Scaling Foundation Models for Netflix Recommendations

In scaling up our foundation model for Netflix recommendations, we draw inspiration from the success of large language models (LLMs). Just as LLMs have demonstrated the power of scaling in improving performance, we find that scaling is crucial for enhancing generative recommendation tasks. Successful scaling demands robust evaluation, efficient training algorithms, and substantial computing resources. Evaluation must effectively differentiate model performance and identify areas for improvement. Scaling involves data, model, and context scaling, incorporating user engagement, external reviews, multimedia assets, and high-quality embeddings. Our experiments confirm that the scaling law also applies to our foundation model, with consistent improvements observed as we increase data and model size.

**Figure 3.** The relationship between model parameter size and relative performance improvement. The plot demonstrates the scaling law in recommendation modeling, showing a trend of increased performance with larger model sizes. The x-axis is logarithmically scaled to highlight growth across different magnitudes.

Conclusion

In conclusion, our Foundation Model for Personalized Recommendation represents a significant step towards creating a unified, data-centric system that leverages large-scale data to increase the quality of recommendations for our members. This approach borrows insights from Large Language Models (LLMs), particularly the principles of semi-supervised learning and end-to-end training, aiming to harness the vast scale of unlabeled user interaction data. Addressing unique challenges, like cold start and presentation bias, the model also acknowledges the distinct differences between language tasks and recommendation. The Foundation Model allows various downstream applications, from direct use as a predictive model to generate user and entity embeddings for other applications, and can be fine-tuned for specific canvases. We see promising results from downstream integrations. This move from multiple specialized models to a more comprehensive system marks an exciting development in the field of personalized recommendation systems.

Acknowledgements

Contributors to this work (name in alphabetical order): Ai-Lei Sun Aish Fenton Anne Cocos Anuj Shah Arash Aghevli Baolin Li Bowei Yan Dan Zheng Dawen Liang Ding Tong Divya Gadde Emma Kong Gary Yeh Inbar Naor Jin Wang Justin Basilico Kabir Nagrecha Kevin Zielnicki Linas Baltrunas Lingyi Liu Luke Wang Matan Appelbaum Michael Tu Moumita Bhattacharya Pablo Delgado Qiuling Xu Rakesh Komuravelli Raveesh Bhalla Rob Story Roger Menezes Sejoon Oh Shahrzad Naseri Swanand Joshi Trung Nguyen Vito Ostuni Wei Wang Zhe Zhang

Reference

C. K. Kang and J. McAuley, “Self-Attentive Sequential Recommendation,” 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 2018, pp. 197–206, doi: 10.1109/ICDM.2018.00035.
F. Sun et al., “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer,” Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM ‘19), Beijing, China, 2019, pp. 1441–1450, doi: 10.1145/3357384.3357895.
J. Zhai et al., “Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations,” arXiv preprint arXiv:2402.17152, 2024.
F. Gloeckle, B. Youbi Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, “Better & Faster Large Language Models via Multi-token Prediction,” arXiv preprint arXiv:2404.19737, Apr. 2024.

Foundation Model for Personalized Recommendation was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Globalizing Productions with Netflix’s Media Production Suite

March 31, 2025, 9:21 am

≫ Next: How Netflix Accurately Attributes eBPF Flow Logs

≪ Previous: Foundation Model for Personalized Recommendation

Jesse Korosi, Thijs van de Kamp, Mayra Vega, Laura Futuro, Anton Margoline

The journey from script to screen is full of challenges in the ever-evolving world of film and television. The industry has always innovated, and over the last decade, it started moving towards cloud-based workflows. However, unlocking cloud innovation and all its benefits on a global scale has proven to be difficult. The opportunity is clear: streamline complex media management logistics, eliminate tedious, non-creative task-based work and enable productions to focus on what matters most–creative storytelling. With these challenges in mind, Netflix has developed a suite of tools by filmmakers for filmmakers: the Media Production Suite (MPS).

What are we solving for?

Significant time and resources are devoted to managing media logistics throughout the production lifecycle. An average Netflix title produces around ~200 Terabytes of Original Camera Files (OCF), with outliers up to 700 Terabytes, not including any work-in-progress files, VFX assets, 3D assets, etc. The data produced on set is traditionally copied to physical tape stock like LTO. This workflow has been considered the industry norm for a long time and may be cost-effective, but comes with trade-offs. Aside from needing to physically ship and track all movement of the tape stock, storing media on a physical tape makes it harder to search, play and share media assets; slowing down accessibility to production media when needed, especially when titles need to collaborate with talent and vendors all over the world.

Even when workflows are fully digital, the distribution of media between multiple departments and vendors can still be challenging. A lack of automation and standardization often results in a labour-intensive process across post-production and VFX with a lot of dependencies that introduce potential human errors and security risks. Many productions utilize a large variety of vendors, making this collaboration a large technical puzzle. As file sizes grow and workflows become more complex, these issues are magnified, leading to inefficiencies that slow down post-production and reduce the available time spent on creative work.

Moving media into the cloud introduces new challenges for production and post ramping up to meet the operational and technological hurdles this poses. For some post-production facilities, it’s not uncommon to see a wall of portable hard drives at their facility, with media being hand-carried between vendors because alternatives are not available. The need for a centralized, cloud-based solution that transcends these barriers is more pressing than ever. This results in a willingness to embrace new and innovative ideas, even if exploratory, and introduce drastic workflow changes to productions in pursuit of creative evolution.

At Netflix, we believe that great stories can come from anywhere, but we have seen that technical limitations in traditional workflows reduce access to media and restrict filmmakers’ access to talent. Besides the need for robust cloud storage for their media, artists need access to powerful workstations and real-time playback. Depending on the market, or production budget, cutting-edge technology might not be available or affordable.

What if we started charting a course to break free from many of these technical limitations and found ways to enhance creativity? Industry trade shows like the International Broadcast Convention (IBC) and the National Association of Broadcasters Show (NAB) highlight a strong global trend: instead of bringing media to the artist/applications (traditional workflow) we see the shift towards bringing people and applications to the media (cloud workflows and remote workstations). The concept of cloud-based workflows is not new, as many technology leaders in our industry have been experimenting in this space for more than a decade. However, executing this vision at a Netflix scale with hundreds of titles a year has not been done before…

The challenge of building a global technology to solve this

Building solutions at a global scale poses significant challenges. The art of making movies and series lacks equal access to technology, best practices, and global standardization. Different countries worldwide are at different phases of innovation based on local needs and nuances. While some regions boast over a century of cinematic history and have a strong industry, others are just beginning to carve their niche. This vast gap presents a unique challenge: developing global technology that caters to both established and emerging markets, each with distinct languages and workflows.

The large diversity of needs by talent and vendors globally creates a standardization challenge and can be seen when productions use a global talent pool. Many mature post-production and VFX facilities have built scripts and automation that flow between various artists and personnel within their facility; allowing a more streamlined workflow, even though the customization is time-consuming. E.g., Transcoding, or transcriptions that automatically run when files are dropped in a hot folder, with the expectation that certain sidecar metadata files will accompany them with a specific organizational structure. Embracing and integrating new workflows introduces the fear of disrupting a well-established process, increasing additional pressure on the profit margins of vendors. Small workflow changes that may seem arbitrary may actually have a large impact on vendors. Therefore, innovation should provide meaningful benefits to a title in order to get adopted at scale. Reliability, a proven track record, strong support, and an incredibly low tolerance for bugs, or issues are top of mind in well-established markets.

In developing this suite, we recognized the necessity of addressing the vast array of titles that flow through Netflix without the luxury of expanding into a massive operational entity. Consequently, automation became imperative. The intricacies of color and framing management, along with deliverables, must be seamlessly controlled and effortlessly managed by the user, without the need for manual intervention. Therefore, we cannot lean into humans configuring JSON files behind the scenes to map camera formats into deliverables. By embracing open standards, we not only streamline these processes but also facilitate smoother collaboration across diverse markets and countries, ensuring that our global productions can operate with unparalleled efficiency and cohesion. To ensure this, we’ve decided to lean heavily into standards like ACES, AMF, ASC MHL, ASC FDL, and OTIO. ACES and AMF for color pipeline management. ASC MHL for any file management/verifications. ASC FDL will serve as our framing interoperability and OTIO for any timeline interchange. Leaning into standards like this means that many things can be automated at scale and more importantly, high-complexity workflows can be offered to markets or shows that don’t normally have access to them. As an example, if a show is shot on various camera formats all framed and recorded at different resolutions, with different lenses and different safeties on each frame. The task of normalizing all of these for a VFX vendor into one common container with a normalized center extracted frame is often only offered to very high-end titles, considering it takes a human behind the curtain to create all of these mappings. But by leaning into a standard like the FDL, it means this can now easily be automated, and the control for these mappings, put directly in the hands of users.

Our Answer — Content Hub’s Media Production Suite (MPS)

Building a global scalable solution that could be utilized in a diversity of markets has been an exciting challenge. We set out to provide customizable and feature-rich tooling for advanced users while remaining intuitive and streamlined enough for less experienced filmmakers. With collaboration from Netflix teams, vendors, and talent across the globe, we’ve taken a bold step forward in enabling a suite of tools inside Netflix Content Hub that democratizes technology: the Media Production Suite. While leveraging our scale economies and access to resources, we can now unlock global talent pools for our productions, drastically reduce non-creative task-based work, streamline workflows, and level the playing field between our markets, ultimately maximizing the time available for what matters most; creative work!

So what is it?

1. Netflix Hybrid Infrastructure: Netflix has invested in a hybrid infrastructure, a mix of cloud-based and physically distributed capabilities operating in multiple locations across the world and close to our productions to optimize user performance. This infrastructure is available for Netflix shows and is foundational under Content Hub’s Media Production Suite tooling. Local storage and compute services are connected through the Netflix Open Connect network (Netflix Content Delivery Network) to the infrastructure of Amazon Web Services (AWS). The system facilitates large volumes of camera and sound media and is built for speed. In order to ensure that productions have sufficient upload speeds to get their media into the cloud, Netflix has started to roll out Content Hub Ingest Centers globally to provide high-speed internet connectivity where required. With all media centralized, MPS eliminates the need for physical media transport and reduces the risk of human error. This approach not only streamlines operations but also enhances security and accessibility.

2. Automation and Tooling: In addition to the Netflix Hybrid infrastructure layer, MPS consists of a suite of tools that tap into the media in the Netflix ecosystem.

Footage Ingest — An application that allows users to upload media/files into Content Hub.

Media Library — A central library that allows users to search, preview, share and download media.

Dailies — A workflow, backed by an operational team, offering automated Quality Control of your footage, sound sync, application of color, rendering, and delivering dailies directly to editorial.

Remote Workstations — Offering access to remote editorial workstations and storage for post-production needs.

VFX Pulls — An automated method for converting and delivering visual effects plates, associated color, and framing files to VFX vendors.

Conform Pulls — An automated method for consolidating, trimming, and delivering all OCF to picture-finishing vendors.

Media Downloader — An automated download tool that initiates a download once media has been made available in the Netflix cloud.

While each of the individual tools within MPS is at different states of maturity, over 350 titles have made use of at least one of the tools noted above. Input has been taken from all over the world while developing, with users ranging from UCAN (United States/Canada), EMEA (Europe, Middle East, and Africa), SEA (South East Asia), LATAM (Latin America), and APAC (Asia Pacific).

Senna: Early Adoption and Insightful Feedback Driving MPS Evolution

*Media from the Brazilian-produced series ‘Senna’ being reviewed in MPS*

The Brazilian-produced series Senna, which follows the life of legendary Formula 1 driver Ayrton Senna, utilized MPS to reshape their content creation workflow, overcome geographical barriers, and unlock innovation to support world-class storytelling for a global audience. Senna is a groundbreaking series, not just for its storytelling but for its production journey across Argentina, Uruguay, Brazil, and the United Kingdom. With editorial teams spread across Porto Alegre and Spain, and VFX studios collaborating across locations in Brazil, Canada, the United States, and India, all orchestrated by our subsidiary Scanline VFX. The series exemplifies the global nature of modern filmmaking and was the perfect fit for Netflix’s new Content Hub Media Production Suite (MPS).

At the heart of Senna’s workflow orchestration is MPS. While each of the tools within MPS is based on an opt-in model, in order to use many of the downstream services, the first step is ensuring that the original camera files (OCF) and original sound files (OSF) are uploaded. “We knew we were going to shoot in different places,” said Post Supervisor Gabriel Queiroz,“to have all this material cloud-based, it’s definitely one of the most important things for us. It would be hard to bring all this media physically from Argentina or wherever to Brazil. It will take us a lot of time.” With Senna shooting across locations, allowing production the capability of uploading their OCF and OSF resulted in no longer requiring shuttling hard drives on airplanes, creating LTO tapes, & managing physical shipments for their negative. And yes, you read that correctly; when utilizing MPS, we don’t require LTO tapes to be written unless there are title-specific needs.

With Senna beginning production back in June of 2023, our investment in MPS was still very early stages, and the tooling was considered beta. However, with the help, feedback, and partnership from this production, it was quickly realized that the investment was worth doubling down on. Since the early version used on Senna, Netflix has been spinning up ingest centers around the world, where drives can be dropped off, and within a matter of hours, all original camera files are uploaded into the Netflix ecosystem. While creating the ability to upload is not a novel concept, behind the scenes, it’s far from simple. Once a drive has been plugged in and our Netflix Footage Ingest application is opened, a validation is run, ensuring all expected media from set is on the drive. After media has been uploaded and checksums are run validating media integrity, all media is inspected, metadata is extracted, and assets are created for viewing/sharing/downloading with playable proxies. All media is then automatically backed up to a second tier of cloud-based storage for the final archive.

Traditionally, if you wanted to check in with your post vendor on how things are going for each of these media management steps noted above, or whether or not you can clear on set camera cards if you haven’t gotten a completion notification, you would have to pick up the phone and call your vendor. For Senna, anyone who wanted visibility on progress, simply logged in to Content Hub and could see any activity in the Footage Ingest dashboard, as well as look up any information needed on past uploads.

*Remote monitoring media being uploaded and archived using the MPS Footage Ingest workflow*

While many services in MPS are available once media has been uploaded, Senna’s use of MPS focused on VFX. With Senna shooting a high volume of footage and the show having a high volume of VFX shots, according to Post Supervisor Gabriel Queiroz “Using MPS was basically a no-brainer, [having] used the tool before, I knew what it could bring to the project. And to be honest, with the amount of footage that we have, it was just so much material and with the amount of vendors we have, knowing that we would have to deliver all this footage to all these kinds of vendors, including outside of Brazil and to different parts of the world.”

With a traditional workflow, utilizing available resources in Latin America, VFX Pulls would have been done manually. This process is prone to human error and more importantly, for a show like Senna, too slow and would have resulted in different I/O methods for every vendor.

*Illustrating a traditional VFX Editor having to manage various I/O methods*

By utilizing MPS, the Assistant Editor was able to log into Content Hub, upload an EDL, and have their VFX Pulls automatically transcoded, color files consolidated and all media placed into a Google Drive style folder built directly in Content Hub (called Workspaces). The VFX Editor was able to make any additional tweaks they wanted to the directory before farming out each of the shots to whichever vendor they were meant for. When it came time for the VFX vendors to then send shots back to editorial or DI, this was also done through MPS. Having one standard method for I/O for all VFX file sharing meant that Editorial and DI did not have to manage a different file transfer/workflow for every single vendor that was onboarded.

*Illustrating a more streamlined workflow for VFX vendors when using MPS*

After picture was locked and it was time for Senna to do their Online, the DI facility Quanta was able to utilize the Conform Pull service within MPS. The Conform Pull service allowed their team to upload an EDL, which ran a QC on all of the media from within their edit to ensure a smooth conform and then consolidated, trimmed, and packaged up all of the media they needed for the online. Since this early beta and thanks to learnings from many shows like Senna, advancements have been made in the system’s ability to match back to source media for both Conform and VFX Pulls. Rather than requiring an exact match between EDL and source OCF, there are several variations of fuzzy matching that can take place, as well as a current investigation in using one of our perceptual matching algorithms, allowing for a perceptual conform using computer vision, instead of solely relying on metadata.

Conclusion

The Media Production Suite (MPS) represents a transformative leap in how we approach media production at Netflix. By embracing open standards, we have crafted a scalable solution that not only makes economic sense but also democratizes access to advanced production tools across the globe. This approach allows us to eliminate tedious tasks, enabling our teams to focus on what truly matters: creative storytelling. By fostering global collaboration and leveraging the power of cloud-based workflows, we’re not just enhancing efficiency but also elevating the quality of our productions. As we continue to innovate and refine our processes, we remain committed to breaking down barriers and unlocking the full potential of creative talent worldwide. The future of filmmaking is here, and with MPS, we are leading the charge toward a more connected and creatively empowered industry.

Globalizing Productions with Netflix’s Media Production Suite was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

How Netflix Accurately Attributes eBPF Flow Logs

April 8, 2025, 10:50 am

≫ Next: Measuring Dialogue Intelligibility for Netflix Content

≪ Previous: Globalizing Productions with Netflix’s Media Production Suite

By Cheng Xie, Bryan Shultz, and Christine Xu

In a previous blog post, we described how Netflix uses eBPF to capture TCP flow logs at scale for enhanced cloud network insights. In this post, we delve deeper into how Netflix solved a core problem: accurately attributing flow IP addresses to workload identities.

A Brief Recap

FlowExporter is a sidecar that runs alongside all Netflix workloads in the AWS Cloud. It uses eBPF and TCP tracepoints to monitor TCP socket state changes. When a TCP socket closes, FlowExporter generates a flow log record that includes the IP addresses, ports, timestamps, and additional socket statistics. On average, 5 million records are produced per second.

In cloud environments, IP addresses are reassigned to different workloads as workload instances are created and terminated, so IP addresses alone cannot provide insights on which workloads are communicating. To make the flow logs useful, each IP address must be attributed to its corresponding workload identity. FlowCollector, a backend service, collects flow logs from FlowExporter instances across the fleet, attributes the IP addresses, and sends these attributed flows to Netflix’s Data Mesh for subsequent stream and batch processing.

The eBPF flow logs provide a comprehensive view of service topology and network health across Netflix’s extensive microservices fleet, regardless of the programming language, RPC mechanism, or application-layer protocol used by individual workloads. This is especially useful for reaching the corners where our Service Mesh does not yet have coverage.

The Problem with Misattribution

Accurately attributing flow IP addresses to workload identities has been a significant challenge since our eBPF flow logs were introduced.

As noted in our previous blog post, our initial attribution approach relied on Sonar, an internal IP address tracking service that emits an event whenever an IP address in Netflix’s AWS VPCs is assigned or unassigned to a workload. FlowCollector consumes a stream of IP address change events from Sonar and uses this information to attribute flow IP addresses in real-time.

The fundamental drawback of this method is that it can lead to misattribution. Delays and failures are inevitable in distributed systems, which may delay IP address change events from reaching FlowCollector. For instance, an IP address may initially be assigned to workload X but later reassigned to workload Y. However, if the change event for this reassignment is delayed, FlowCollector will continue to assume that the IP address belongs to workload X, resulting in misattributed flows. Additionally, event timestamps may be inaccurate depending on how they are captured.

Misattribution rendered the flow data unreliable for decision-making. Users often depend on flow logs to validate workload dependencies, but misattribution creates confusion. Without expert knowledge of expected dependencies, users would struggle to identify or confirm misattribution. Moreover, misattribution occurred frequently for critical services with a large footprint due to frequent IP address changes. Overall, misattribution makes fleet-wide dependency analysis impractical.

As a workaround, we made FlowCollector hold received flows for 15 minutes before attribution, allowing time for delayed IP address change events. While this approach reduced misattribution, it did not eliminate it. Moreover, the waiting period made the data less fresh, reducing its utility for real-time analysis.

Fully eliminating misattribution is crucial because it only takes a single misattributed flow to produce an incorrect workload dependency. Solving this problem required a complete rethinking of our approach. Over the past year, Netflix developed a new attribution method that has finally eliminated misattribution, as detailed in the rest of this post.

Attributing Local IP Addresses

Each socket has two IP addresses: a local IP address and a remote IP address. Previously, we used the same method to attribute both. However, attributing the local IP address should be a simpler task since the local IP address belongs to the instance where FlowExporter captures the socket. Therefore, FlowExporter should determine the local workload identity from its environment and attribute the local IP address before sending the flow to FlowCollector.

This is straightforward for workloads running directly on EC2 instances, as Netflix’s Metatron provisions workload identity certificates to each EC2 instance at boot time. FlowExporter can simply read these certificates from the local disk to determine the local workload identity.

Attributing local IP addresses for container workloads running on Netflix’s container platform, Titus, is more challenging. FlowExporter runs at the container host level, where each host manages multiple container workloads with different identities. When FlowExporter’s eBPF programs receive a socket event from TCP tracepoints in the kernel, the socket may have been created by one of the container workloads or by the host itself. Therefore, FlowExporter must determine which workload to attribute the socket’s local IP address to. To solve this problem, we leveraged IPMan, Netflix’s container IP address assignment service. IPManAgent, a daemon running on every container host, is responsible for assigning and unassigning IP addresses. As container workloads are launched, IPManAgent writes an IP-address-to-workload-ID mapping to an eBPF map, which FlowExporter’s eBPF programs can then use to look up the workload ID associated with a socket local IP address.

Another challenge was to accommodate Netflix’s IPv6 to IPv4 translation mechanism on Titus. To facilitate IPv6 migration, Netflix developed a mechanism that enables IPv6-only containers to communicate with IPv4 destinations without incurring NAT64 overhead. This mechanism intercepts connect syscalls and replaces the underlying socket with one that uses a shared IPv4 address assigned to the container host. This confuses FlowExporter because the kernel reports the same local IPv4 address for sockets created by different container workloads. To disambiguate, local port information is additionally required. We modified Titus to write a mapping of (local IPv4 address, local port) to the workload ID into an eBPF map whenever a connect syscall is intercepted. FlowExporter’s eBPF programs then use this map to correctly attribute sockets created by the translation mechanism.

With these problems solved, we can now accurately attribute the local IP address of every flow.

Attributing Remote IP Addresses

Once the local IP address attribution problem is solved, accurately attributing remote IP addresses becomes feasible. Now, each flow reported by FlowExporter includes the local IP address, the local workload identity, and connection start/end timestamps. As FlowCollector receives these flows, it can learn the time ranges during which each workload owns a given IP address. For instance, if FlowCollector sees a flow with local IP address 10.0.0.1 associated with workload X that starts at t1 and ends at t2, it can deduce that 10.0.0.1 belonged to workload X from t1 to t2. Since Netflix uses Amazon Time Sync across its fleet, the timestamps (captured by FlowExporter) are reliable.

The FlowCollector service cluster consists of many nodes. Every node must be capable of attributing arbitrary remote IP addresses and, therefore, requires knowledge of all workload IP addresses and their recent ownership records. To represent this knowledge, each node maintains an in-memory hashmap that maps an IP address to a list of time ranges, as illustrated by the following Go structs:

type IPAddressTracker struct {
    ipToTimeRanges map[netip.Addr]timeRanges
}

type timeRanges []timeRange

type timeRange struct {
    workloadID   string
    start        time.Time
    end          time.Time
}

To populate the hashmap, FlowCollector extracts the local IP address, local workload identity, start time, and end time from each received flow and creates/extends the corresponding time ranges in the map. The time ranges for each IP address are sorted in ascending order, and they are non-overlapping since an IP address cannot belong to two different workloads simultaneously.

Since each flow is only sent to one FlowCollector node, each node must share the time ranges it learned from received flows with other nodes. We implemented a broadcasting mechanism using Kafka, where each node publishes learned time ranges to all other nodes. Although more efficient broadcasting implementations exist, the Kafka-based approach is simple and has worked well for us.

Now, FlowCollector can attribute remote IP addresses by looking them up in the populated map, which returns a list of time ranges. It then uses the flow’s start timestamp to determine the corresponding time range and associated workload identity. If the start time does not fall within any time range, FlowCollector will retry after a delay, eventually giving up if the retry fails. Such failures may occur when flows are lost or broadcast messages are delayed. For our use cases, it is acceptable to leave a small percentage of flows unattributed, but any misattribution is unacceptable.

This new method achieves accurate attribution thanks to the continuous heartbeats, each associated with a reliable time range of IP address ownership. It handles transient issues gracefully — a few delayed or lost heartbeats do not lead to misattribution. In contrast, the previous method relied solely on discrete IP address assignment and unassignment events. Lacking heartbeats, it had to presume an IP address remained assigned until notified otherwise (which can be hours or days later), making it vulnerable to misattribution when the notifications were delayed.

One detail is that when FlowCollector receives a flow, it cannot attribute its remote IP address right away because it requires the latest observed time ranges for the remote IP address. Since FlowExporter reports flows in batches every minute, FlowCollector must wait until it receives the flow batch from the remote workload FlowExporter for the last minute, which may not have arrived yet. To address this, FlowCollector temporarily stores received flows on disk for one minute before attributing their remote IP addresses. This introduces a 1-minute delay, but it is much shorter than the 15-minute delay with the previous approach.

In addition to producing accurate attribution, the new method is also cost-effective thanks to its simplicity and in-memory lookups. Because the in-memory state can be quickly rebuilt when a FlowCollector node starts up, no persistent storage is required. With 30 c7i.2xlarge instances, we can process 5 million flows per second for the entire Netflix fleet.

Attributing Cross-Regional IP Addresses

For simplicity, we have so far glossed over one topic: regionalization. Netflix’s cloud microservices operate across multiple AWS regions. To optimize flow reporting and minimize cross-regional traffic, a FlowCollector cluster runs in each major region, and FlowExporter agents send flows to their corresponding regional FlowCollector. When FlowCollector receives a flow, its local IP address is guaranteed to be within the region.

To minimize cross-region traffic, the broadcasting mechanism is limited to FlowCollector nodes within the same region. Consequently, the IP address time ranges map contains only IP addresses from that region. However, cross-regional flows have a remote IP address in a different region. To attribute these flows, the receiving FlowCollector node forwards them to nodes in the corresponding region. FlowCollector determines the region for a remote IP address by looking up a trie built from all Netflix VPC CIDRs. This approach is more efficient than broadcasting IP address time range updates across all regions, as only 1% of Netflix flows are cross-regional.

Attributing Non-Workload IP Addresses

So far, FlowCollector can accurately attribute IP addresses belonging to Netflix’s cloud workloads. However, not all flow IP addresses fall into this category. For instance, a significant portion of flows goes through AWS ELBs. For these flows, their remote IP addresses are associated with the ELBs, where we cannot run FlowExporter. Consequently, FlowCollector cannot determine their identities by simply observing the received flows. To attribute these remote IP addresses, we continue to use IP address change events from Sonar, which crawls AWS resources to detect changes in IP address assignments. Although this data stream may contain inaccurate timestamps and be delayed, misattribution is not a main concern since ELB IP address reassignment occurs very infrequently.

Verifying Correctness

Verifying that the new method has eliminated misattribution is challenging due to the lack of a definitive source of truth for workload dependencies to validate flow logs against; the flow logs themselves are intended to serve as this source of truth, after all. To build confidence, we analyzed the flow logs of a large service with well-understood dependencies. A large footprint is necessary, as misattribution is more prevalent in services with numerous instances, and there must be a reliable method to determine the dependencies for this service without relying on flow logs.

Netflix’s cloud gateway, Zuul, served this purpose perfectly due to its extensive footprint (handling all cloud ingress traffic), its large number of downstream dependencies, and our ability to derive its dependencies from its routing configurations as the source of truth for comparison with flow logs. We found no misattribution for flows through Zuul over a two-week window. This provided strong confidence that the new attribution method has eliminated misattribution. In the previous approach, approximately 40% of Zuul’s dependencies reported by the flow logs were misattributed.

Conclusion

With misattribution solved, eBPF flow logs now deliver dependable, fleet-wide insights into Netflix’s service topology and network health. This advancement unlocks numerous exciting opportunities in areas such as service dependency auditing, security analysis, and incident triage, while helping Netflix engineers develop a better understanding of our ever-evolving distributed systems.

Acknowledgments

We would like to thank Martin Dubcovsky, Joanne Koong, Taras Roshko, Nabil Schear, Jacob Meyers, Parsha Pourkhomami, Hechao Li, Donavan Fritz, Rob Gulewich, Amanda Li, John Salem, Hariharan Ananthakrishnan, Keerti Lakshminarayan, and other stunning colleagues for their feedback, inspiration, and contributions to the success of this effort.

How Netflix Accurately Attributes eBPF Flow Logs was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Measuring Dialogue Intelligibility for Netflix Content

May 7, 2025, 5:40 pm

≫ Next: Behind the Scenes: Building a Robust Ads Event Processing Pipeline

≪ Previous: How Netflix Accurately Attributes eBPF Flow Logs

Enhancing Member Experience Through Strategic Collaboration

Ozzie Sutherland, Iroro Orife, Chih-Wei Wu, Bhanu Srikanth

At Netflix, delivering the best possible experience for our members is at the heart of everything we do, and we know we can’t do it alone. That’s why we work closely with a diverse ecosystem of technology partners, combining their deep expertise with our creative and operational insights. Together, we explore new ideas, develop practical tools, and push technical boundaries in service of storytelling. This collaboration not only empowers the talented creatives working on our shows with better tools to bring their vision to life, but also helps us innovate in service of our members. By building these partnerships on trust, transparency, and shared purpose, we’re able to move faster and more meaningfully, always with the goal of making our stories more immersive, accessible, and enjoyable for audiences everywhere. One area where this collaboration is making a meaningful impact is in improving dialogue intelligibility, from set to screen. We call this the Dialogue Integrity Pipeline.

Dialogue Integrity Pipeline

We’ve all been there, settling in for a night of entertainment, only to find ourselves straining to catch what was just said on screen. You’re wrapped up in the story, totally invested, when suddenly a key line of dialogue vanishes into thin air. “Wait, what did they say? I can’t understand the dialogue! What just happened?”

You may pick up the remote and rewind, turn up the volume, or try to stay with it and hope this doesn’t happen again. Creating sophisticated, modern series and films requires an incredible artistic & technical effort. At Netflix, we strive to ensure those great stories are easy for the audience to enjoy. Dialogue intelligibility can break down at multiple points in what we call the Dialogue Integrity Pipeline, the journey from on-set capture to final playback at home. Many facets of the process can contribute to dialogue that’s difficult to understand:

Naturalistic acting styles, diverse speech patterns, and accents
Noisy locations, microphone placement problems on set
Cinematic (high dynamic range) mixing styles, excessive dialogue processing, substandard equipment
Audio compromises through the distribution pipeline
TVs with inadequate speakers, noisy home environments

Addressing these issues is critical to maintaining the standard of excellence our content deserves.

Measurement at Scale

Netflix utilizes industry-standard loudness meters to measure content and its adherence to our core loudness specifications. This tool also provides feedback on audio dynamic range (loud to soft) which impacts dialogue intelligibility. The Audio Algorithms team at Netflix wanted to take these measurements further and develop a holistic understanding of dialogue intelligibility throughout the runtime of a given title.

The team developed a Speech Intelligibility measurement system based on the Short-time Objective Intelligibility (STOI) metric [Taal et al. (IEEE Transactions on Audio, Speech, and Language Processing)]. Firstly, a speech activity detector analyses the dialogue stem to render speech utterances, which are then compared to non-speech sounds in the mix, typically Music and Effects. Then the system calculates the Signal-to-Noise ratio, in each speech frequency band, the results of which are summarized succinctly, per-utterance on the range [0, 1.0], to quantify the degree to which competing Music and Effects can distract the listener.

This chart shows how eSTOI (extended Short-Time Objective Intelligibility) method measures dialogue (fg [foreground] stem in the graphic) against non-speech (bg [background] stem in the graphic) to judge intelligibility based on competing non-speech sound.

Optimizing Dialogue Prior to Delivery

Understanding dialogue intelligibility across Netflix titles is invaluable, but our mission goes beyond analysis — we strive to empower creators with the tools to craft mixes that resonate seamlessly with audiences at home.

Seeing the lack of dedicated Dialogue Intelligibility Meter plugins for Digital Audio Workstations, we teamed up with industry leaders, Fraunhofer Institute for Digital Media Technology IDMT (Fraunhofer IDMT) and Nugen Audio to pioneer a solution that enhances creative control and ensures crystal-clear dialogue from mix to final delivery.

We collaborated with Fraunhofer IDMT to adapt their machine-learning-based speech intelligibility solution for cross-platform plugin standards and brought in Nugen Audio to develop DAW-compatible plugins.

Fraunhofer IDMT

The Fraunhofer Department of Hearing, Speech, and Audio Technology HSA has done significant research and development on media processing tools that measure speech intelligibility. In 2020, the machine learning-based method was integrated into Steinberg’s Nuendo Digital Audio Workstation. We approached the Fraunhofer engineering team with a collaboration proposal to make their technology accessible to other audio workstations through the cross-platform VST (Virtual Studio Technology) and AAX (Avid Audio Extension) plugin standards. The scientists were keen on the project and provided their dialogue intelligibility library.

Nugen Audio

Nugen Audio created the VisLM plugin to provide sound teams with an efficient and accurate way to measure mixes for conformance to traditional broadcast & streaming specifications — Full Mix Loudness, Dialogue Loudness, and True Peak. Since then, VisLM has become a widely used tool throughout the global post-production industry. Nugen Audio partnered with Fraunhofer, integrating the Fraunhofer IDMT Dialogue Intelligibility libraries into a new industry-first tool — Nugen DialogCheck. This tool gives re-recording mixers real-time insights, helping them adjust dialogue clarity at the most crucial points in the mixing process, ensuring every word is clear and understood.

Clearer Dialogue Through Collaboration

Crafting crystal-clear dialogue isn’t just a technical challenge — it’s an art that requires continuous innovation and strong industry collaboration. To empower creators, Netflix and its partners are embedding advanced intelligibility measurement tools directly into DAWs, giving sound teams the ability to:

Detect and resolve dialogue clarity issues early in the mix.
Fine-tune speech intelligibility without compromising artistic intent.
Deliver immersive, accessible storytelling to every viewer, in any listening environment.

At Netflix, we’re committed to pushing the boundaries of audio excellence, from innovating scaled intelligibility measurements to collaborating with Fraunhofer and Nugen Audio on cutting-edge tools like the DialogCheck Plugin, we’re setting a new standard for dialogue clarity — ensuring every word is heard exactly as creators intended. But innovation doesn’t happen in isolation. By working together with our partners, we can continue to push the limits of what’s possible, fueling creativity and driving the future of storytelling.

Finally, we’d like to extend a heartfelt thanks to Scott Kramer for his contributions to this initiative.

Measuring Dialogue Intelligibility for Netflix Content was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Behind the Scenes: Building a Robust Ads Event Processing Pipeline

May 9, 2025, 12:44 pm

≫ Next: FM-Intent: Predicting User Session Intent with Hierarchical Multi-Task Learning

≪ Previous: Measuring Dialogue Intelligibility for Netflix Content

Kinesh Satiya

Introduction

In a digital advertising platform, a robust feedback system is essential for the lifecycle and success of an ad campaign. This system comprises of diverse sub-systems designed to monitor, measure, and optimize ad campaigns. At Netflix, we embarked on a journey to build a robust event processing platform that not only meets the current demands but also scales for future needs. This blog post delves into the architectural evolution and technical decisions that underpin our Ads event processing pipeline.

Ad serving acts like the “brain” — making decisions, optimizing delivery and ensuring right Ad is shown to the right member at the right time. Meanwhile, ad events, after an Ad is rendered, function like “heartbeats”, continuously providing real-time feedback (oxygen/nutrients) that fuels better decision-making, optimizations, reporting, measurement, and billing. Expanding on this analogy:

Just as the brain relies on continuous blood flow, ad serving depends on a steady stream of ad events to adjust next ad serving decision, frequency capping, pacing, and personalization.
If the nervous system stops sending signals (ad events stop flowing), the brain (ad serving) lacks critical insights and starts making poor decisions or even fails.
The healthier and more accurate the event stream (just like strong heart function), the better the ad serving system can adapt, optimize, and drive business outcomes.

Let’s dive into the journey of building this pipeline.

The Pilot

In November 2022, we launched a brand new basic ads plan, in partnership with Microsoft. The software systems extended the existing Netflix playback systems to play ads. Initially, the system was designed to be simple, secure, and efficient, with an underlying ethos of device-originated and server-proxied operations. The system consisted of three main components: the Microsoft Ad Server, Netflix Ads Manager, and Ad Event Handler. Each ad served required tracking to ensure the feedback loop functioned effectively, providing the external ad server with insights on impressions, frequency capping (advertiser policy that limits the number of times a user sees a specific ad), and monetization processes.

Key features of this system include:

Client Request: Client devices request for ads during an ad break from Netflix playback systems, which is then decorated with information by ads manager to request ads from the ad server.
Server-Side Ad Insertion: The Ad Server sends ad responses using the VAST (Video Ad Serving Template) format.
Netflix Ads Manager: This service parses VAST documents, extracts tracking event information, and creates a simplified response structure for Netflix playback systems and client devices.
— The tracking information is packed into a structured protobuf data model.
— This structure is encrypted to create an opaque token.
— The final response, informs the client devices, when to send an event and the corresponding token.
Client Device: During ad playback, client devices send events accompanied by a token. The Netflix telemetry system then enqueues all these events in Kafka for asynchronous processing.
Ads Event Handler: This component is a Kafka consumer, that reads/decrypts the event payload and forwards the tracking information encoded back to the ad server and other vendors.

There is an excellent prior blog post that explains how this systems was tested end-to-end at scale. This system design allowed us to quickly add new integrations for verification with vendors like DV, IAS and Nielsen for measurement.

The Expansion

As we continued to expand our third-party (3P) advertising vendors for measurement, tracking and verification, we identified a critical trend: growth in the volume of data encapsulated within opaque tokens. These tokens, which are cached on client devices, present a risk of elevated memory usage, potentially impacting device performance. We also anticipated increase in third-party tracking URLs, metadata needs, and more event types as our business added new capabilities.

To strategically address these challenges, we introduced a new persistence layer using Key-Value abstraction, between ad serving and event handling system: Ads Metadata Registry. This transient storage service stores metadata for each Ad served, and upon callback, event handler would read the tracking information to relay information to the vendors. The contract between the client device and Ads systems continues to use the opaque token per event, but now, instead of tracking information, it contains reference identifiers — Ad ID, the corresponding metadata record ID in the registry and the event name. This approach future proofed our systems to handle any growth in data that needs to pass from ad serving to event handling systems.

Fig 2: Storage service between Ad Serving & Reporting

The Evolution

In January of 2024, we decided to invest in in-house advertising technology platform. This implied that the event processing pipeline had to evolve significantly — attain parity with existing offerings and continue to support new product launches with rapid iterations using in-house Netflix Ad Server. This required re-evaluation of the entire architecture across all of Ads engineering teams.

First, we made an inventory of the use-cases that would need to be supported through ad events.

We’d need to start supporting frequency capping in-house for all ads through Netflix Ad server.
Incorporate pricing information for impressions to set the stage for billing events, which are used to charge advertisers.
A robust reporting system to share campaign reports with advertisers, combined with metrics data collection, helps assess the delivery and effectiveness of the campaign.
Scale event handler to perform tracking information look-ups across different vendors.

Next, we examined upcoming launches, such as Pause/Display ads, to gain deeper insights into our strategic initiatives. We recognized that Display Ads would utilize a distinct logging framework, suggesting that different upstream pipelines might deliver ad telemetry. However, the downstream use-cases were expected to remain largely consistent. Additionally, by reviewing the goals of our telemetry teams, we saw large initiatives aimed at upgrading the platform, indicating potential future migrations.

Keeping the above insights & challenges in mind,

We planned a centralized ad event collection system. This centralized service would consolidate common operations like decryption of tokens, enrichment, hashing identifiers into a single step execution and provide a single unified data contract to consumers that is highly extensible (like being agnostic to ad server & ad media).
We proposed moving all consumers of ad telemetry downstream of the centralized service. This creates a clean separation between upstream systems and consumers in Ads Engineering.
In the initial development phase of our advertising system, a crucial component was the creation of ad sessions based on individual ad events. This system was constructed using ad playback telemetry, which allowed us to gather essential metrics from these ad sessions. A significant decision in this plan was to position the ad sessionization process downstream of the raw ad events.
The proposal also recommended moving all our Ads data processing pipelines for reporting/analytics/metrics for Ads using the data published by the centralized system.

Putting together all the components in our vision -

Key components on event processing pipeline -

Ads Event Publisher: This centralized system is responsible for collecting ads telemetry and providing unified ad events to the ads engineering teams. It supports various functions such as measurement, finance/billing, reporting, frequency capping, and maintaining an essential feedback loop back to the ad server.

Realtime Consumers

Frequency Capping: This system tracks impressions for each campaign, profile, and any other frequency capping parameters set up for the campaign. It is utilized by the Ad Server during each ad decision to ensure ads are served with frequency limits.
Ads Metrics: This component is a Flink job that transforms raw data to a set of dimensions and metrics, subsequently writing to Apache Druid OLAP database. The streaming data is further backed by an offline process that corrects any inaccuracy during streaming ingestion and providing accurate metrics. It provides real-time metrics to assess the delivery health of campaigns and applies budget capping functionality.
Ads Sessionizer: An Apache Flink job that consolidates all events related to a single ad into an Ad Session. This session provides real-time information about ad playback, offering essential business insights and reporting. It is a crucial job that supports all downstream analytical and reporting processes.
Ads Event Handler: This service continuously sends information to ad vendors by reading tracking information from ad events, ensuring accurate and timely data exchange.

Billing/Revenue: These are offline workflows designed to curate impressions, supporting billing and revenue recognition processes.

Ads Reporting & Metrics: This service powers reporting module for our account managers and provides a centralized metrics API that help assess the delivery of a campaign.

This was a massive multi-quarter effort across different engineering teams. With extensive planning (kudos to our TPM team!) and coordination, we were able to iterate fast, build several services and execute the vision above, to power our in-house ads technology platform.

Conclusion

These systems have significantly accelerated our ability to launch new capabilities for the business.

Through our partnership with Microsoft, Display Ad events were integrated into the new pipeline for reusability and ensuring when launching through Netflix ads systems, all use-cases were covered.
Programmatic buying capabilities now support the exchange of numerous trackers and dynamic bid prices on impression events.
Sharing opt-out signals helps ensure privacy and compliance with GDPR regulations for Ads business in Europe, supporting accurate reporting and measurement.
New event types like Ad clicks and scanning of QR codes events also flow through the pipeline, ensuring all metrics and reporting are tracked consistently.

Key Takeways

Strategic, incremental evolution: The development of our ads event processing systems has been a carefully orchestrated journey. Each iteration was meticulously planned by addressing existing challenges, anticipating future needs, and showcasing teamwork, planning, and coordination across various teams. These pillars have been fundamental to the success of this journey.
Data contract: A clear data contract has been pivotal in ensuring consistency in interpretation and interoperability across our systems. By standardizing the data models, and establishing a clear data exchange between ad serving, and centralized event collection, our teams have been able to iterate at exceptional speed and continue to deliver many launches on time.
Separation of concerns: Consumers are relieved from the need to understand each source of ad telemetry or manage updates and migrations. Instead, a centralized system handles these tasks, allowing consumers to focus on their core business logic.

We have an exciting list of projects on the horizon. These include managing ad events from ads on Netflix live streams, de-duplication processes, and enriching data signals to deliver enhanced reporting and insights. Additionally, we are advancing our Native Ads strategy, integrating Conversion API for improved conversion tracking, among many others.

This is definitely not a season finale; it’s just the beginning of our journey to create a best-in-class ads technology platform. We warmly invite you to share your thoughts and comments with us. If you’re interested in learning more or becoming a part of this innovative journey, Ads Engineering is hiring!

Acknowledgements

A special thanks to our amazing colleagues and teams who helped build our foundational post-impression system: Simon Spencer, Priyankaa Vijayakumar, Indrajit Roy Choudhury; Ads TPM team — Sonya Bellamy; the Ad Serving Team — Andrew Sweeney, Tim Zheng, Haidong Tang and Ed Barker; the Ads Data Engineering Team — Sonali Sharma, Harsha Arepalli, and Wini Tran; Product Data Systems — David Klosowski; and the entire Ads Reporting and Measurement team!

Behind the Scenes: Building a Robust Ads Event Processing Pipeline was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

FM-Intent: Predicting User Session Intent with Hierarchical Multi-Task Learning

May 21, 2025, 9:28 am

≫ Next: Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix

≪ Previous: Behind the Scenes: Building a Robust Ads Event Processing Pipeline

Authors: Sejoon Oh, Moumita Bhattacharya, Yesu Feng, Sudarshan Lamkhede, Ko-Jen Hsiao, and Justin Basilico

Motivation

Recommender systems have become essential components of digital services across e-commerce, streaming media, and social networks [1, 2]. At Netflix, these systems drive significant product and business impact by connecting members with relevant content at the right time [3, 4]. While our recommendation foundation model (FM) has made substantial progress in understanding user preferences through large-scale learning from interaction histories (please refer to this article about FM @ Netflix), there is an opportunity to further enhance its capabilities. By extending FM to incorporate the prediction of underlying user intents, we aim to enrich its understanding of user sessions beyond next-item prediction, thereby offering a more comprehensive and nuanced recommendation experience.

Recent research has highlighted the importance of understanding user intent in online platforms [5, 6, 7, 8]. As Xia et al. [8] demonstrated at Pinterest, predicting a user’s future intent can lead to more accurate and personalized recommendations. However, existing intent prediction approaches typically employ simple multi-task learning that adds intent prediction heads to next-item prediction models without establishing a hierarchical relationship between these tasks.

To address these limitations, we introduce FM-Intent, a novel recommendation model that enhances our foundation model through hierarchical multi-task learning. FM-Intent captures a user’s latent session intent using both short-term and long-term implicit signals as proxies, then leverages this intent prediction to improve next-item recommendations. Unlike conventional approaches, FM-Intent establishes a clear hierarchy where intent predictions directly inform item recommendations, creating a more coherent and effective recommendation pipeline.

FM-Intent makes three key contributions:

A novel recommendation model that captures user intent on the Netflix platform and enhances next-item prediction using this intent information.
A hierarchical multi-task learning approach that effectively models both short-term and long-term user interests.
Comprehensive experimental validation showing significant performance improvements over state-of-the-art models, including our foundation model.

Understanding User Intent in Netflix

In the Netflix ecosystem, user intent manifests through various interaction metadata, as illustrated in Figure 1. FM-Intent leverages these implicit signals to predict both user intent and next-item recommendations.

Figure 1: Overview of user engagement data in Netflix. User intent can be associated with several interaction metadata. We leverage various implicit signals to predict user intent and next-item.

In Netflix, there can be multiple types of user intents. For instance,

Action Type: Categories reflecting what users intend to do on Netflix, such as discovering new content versus continuing previously started content. For example, when a member plays a follow-up episode of something they were already watching, this can be categorized as “continue watching” intent.

Genre Preference: The pre-defined genre labels (e.g., Action, Thriller, Comedy) that indicate a user’s content preferences during a session. These preferences can shift significantly between sessions, even for the same user.

Movie/Show Type: Whether a user is looking for a movie (typically a single, longer viewing experience) or a TV show (potentially multiple episodes of shorter duration).

Time-since-release: Whether the user prefers newly released content, recent content (e.g., between a week and a month), or evergreen catalog titles.

These dimensions serve as proxies for the latent user intent, which is often not directly observable but crucial for providing relevant recommendations.

FM-Intent Model Architecture

FM-Intent employs a hierarchical multi-task learning approach with three major components, as illustrated in Figure 2.

Figure 2: An architectural illustration of our hierarchical multi-task learning model FM-Intent for user intent and item predictions. We use ground-truth intent and item-ID labels to optimize predictions.

1. Input Feature Sequence Formation

The first component constructs rich input features by combining interaction metadata. The input feature for each interaction combines categorical embeddings and numerical features, creating a comprehensive representation of user behavior.

2. User Intent Prediction

The intent prediction component processes the input feature sequence through a Transformer encoder and generates predictions for multiple intent signals.

The Transformer encoder effectively models the long-term interest of users through multi-head attention mechanisms. For each prediction task, the intent encoding is transformed into prediction scores via fully-connected layers.

A key innovation in FM-Intent is the attention-based aggregation of individual intent predictions. This approach generates a comprehensive intent embedding that captures the relative importance of different intent signals for each user, providing valuable insights for personalization and explanation.

3. Next-Item Prediction with Hierarchical Multi-Task Learning

The final component combines the input features with the user intent embedding to make more accurate next-item recommendations.

FM-Intent employs hierarchical multi-task learning where intent predictions are conducted first, and their results are used as input features for the next-item prediction task. This hierarchical relationship ensures that the next-item recommendations are informed by the predicted user intent, creating a more coherent and effective recommendation model.

Offline Results

We conducted comprehensive offline experiments on sampled Netflix user engagement data to evaluate FM-Intent’s performance. Note that FM-Intent uses a much smaller dataset for training compared to the FM production model due to its complex hierarchical prediction architecture.

Next-Item and Next-Intent Prediction Accuracy

Table 1 compares FM-Intent with several state-of-the-art sequential recommendation models, including our production model (FM-Intent-V0).

Table 1: Next-item and next-intent prediction results of baselines and our proposed method FM-Intent on the Netflix user engagement dataset.

All metrics are represented as relative % improvements compared to the SOTA baseline: TransAct. N/A indicates that a model is not capable of predicting a certain intent. Note that we added additional fully-connected layers to LSTM, GRU, and Transformer baselines in order to predict user intent, while we used original implementations for other baselines. FM-Intent demonstrates statistically significant improvement of 7.4% in next-item prediction accuracy compared to the best baseline (TransAct).

Most baseline models show limited performance as they either cannot predict user intent or cannot incorporate intent predictions into next-item recommendations. Our production model (FM-Intent-V0) performs well but lacks the ability to predict and leverage user intent. Note that FM-Intent-V0 is trained with a smaller dataset for a fair comparison with other models; the actual production model is trained with a much larger dataset.

Qualitative Analysis: User Clustering

Figure 3: K-means++ (K=10) clustering of user intent embeddings found by FM-Intent; FM-Intent finds unique clusters of users that share the similar intent.

FM-Intent generates meaningful user intent embeddings that can be used for clustering users with similar intents. Figure 3 visualizes 10 distinct clusters identified through K-means++ clustering. These clusters reveal meaningful user segments with distinct viewing patterns:

Users who primarily discover new content versus those who continue watching recent/favorite content.
Genre enthusiasts (e.g., anime/kids content viewers).
Users with specific viewing patterns (e.g., Rewatchers versus casual viewers).

Potential Applications of FM-Intent

FM-Intent has been successfully integrated into Netflix’s recommendation ecosystem, can be leveraged for several downstream applications:

Personalized UI Optimization: The predicted user intent could inform the layout and content selection on the Netflix homepage, emphasizing different rows based on whether users are in discovery mode, continue-watching mode, or exploring specific genres.

Analytics and User Understanding: Intent embeddings and clusters provide valuable insights into viewing patterns and preferences, informing content acquisition and production decisions.

Enhanced Recommendation Signals: Intent predictions serve as features for other recommendation models, improving their accuracy and relevance.

Search Optimization: Real-time intent predictions help prioritize search results based on the user’s current session intent.

Conclusion

FM-Intent represents an advancement in Netflix’s recommendation capabilities by enhancing them with hierarchical multi-task learning for user intent prediction. Our comprehensive experiments demonstrate that FM-Intent significantly outperforms state-of-the-art models, including our prior foundation model that focused solely on next-item prediction. By understanding not just what users might watch next but what underlying intents users have, we can provide more personalized, relevant, and satisfying recommendations.

Acknowledgements

We thank our stunning colleagues in the Foundation Model team & AIMS org. for their valuable feedback and discussions. We also thank our partner teams for getting this up and running in production.

References

[1] Amatriain, X., & Basilico, J. (2015). Recommender systems in industry: A netflix case study. In Recommender systems handbook (pp. 385–419). Springer.

[2] Gomez-Uribe, C. A., & Hunt, N. (2015). The netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS), 6(4), 1–19.

[3] Jannach, D., & Jugovac, M. (2019). Measuring the business value of recommender systems. ACM Transactions on Management Information Systems (TMIS), 10(4), 1–23.

[4] Bhattacharya, M., & Lamkhede, S. (2022). Augmenting Netflix Search with In-Session Adapted Recommendations. In Proceedings of the 16th ACM Conference on Recommender Systems (pp. 542–545).

[5] Chen, Y., Liu, Z., Li, J., McAuley, J., & Xiong, C. (2022). Intent contrastive learning for sequential recommendation. In Proceedings of the ACM Web Conference 2022 (pp. 2172–2182).

[6] Ding, Y., Ma, Y., Wong, W. K., & Chua, T. S. (2021). Modeling instant user intent and content-level transition for sequential fashion recommendation. IEEE Transactions on Multimedia, 24, 2687–2700.

[7] Liu, Z., Chen, H., Sun, F., Xie, X., Gao, J., Ding, B., & Shen, Y. (2021). Intent preference decoupling for user representation on online recommender system. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence (pp. 2575–2582).

[8] Xia, X., Eksombatchai, P., Pancha, N., Badani, D. D., Wang, P. W., Gu, N., Joshi, S. V., Farahpour, N., Zhang, Z., & Zhai, A. (2023). TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 5249–5259).

FM-Intent: Predicting User Session Intent with Hierarchical Multi-Task Learning was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix

June 12, 2025, 7:56 am

≫ Next: AV1 @ Scale: Film Grain Synthesis, The Awakening

≪ Previous: FM-Intent: Predicting User Session Intent with Hierarchical Multi-Task Learning

By Alex Hutter, Alexandre Bertails, Claire Wang, Haoyuan He, Kishore Banala, Peter Royal, Shervin Afshar

As Netflix’s offerings grow — across films, series, games, live events, and ads — so does the complexity of the systems that support it. Core business concepts like ‘actor’ or ‘movie’ are modeled in many places: in our Enterprise GraphQL Gateway powering internal apps, in our asset management platform storing media assets, in our media computing platform that powers encoding pipelines, to name a few. Each system models these concepts differently and in isolation, with little coordination or shared understanding. While they often operate on the same concepts, these systems remain largely unaware of that fact, and of each other.

Spider-Man Pointing meme with each Spider-Man labelled as: “it’s a movie”, “it’s a tv show”, “it’s a game”.

As a result, several challenges emerge:

Duplicated and Inconsistent Models — Teams re-model the same business entities in different systems, leading to conflicting definitions that are hard to reconcile.
Inconsistent Terminology — Even within a single system, teams may use different terms for the same concept, or the same term for different concepts, making collaboration harder.
Data Quality Issues — Discrepancies and broken references are hard to detect across our many microservices. While identifiers and foreign keys exist, they are inconsistently modeled and poorly documented, requiring manual work from domain experts to find and fix any data issues.
Limited Connectivity — Within systems, relationships between data are constrained by what each system supports. Across systems, they are effectively non-existent.

To address these challenges, we need new foundations that allow us to define a model once, at the conceptual level, and reuse those definitions everywhere. But it isn’t enough to just document concepts; we need to connect them to real systems and data. And more than just connect, we have to project those definitions outward, generating schemas and enforcing consistency across systems. The conceptual model must become part of the control plane.

These were the core ideas that led us to build UDA.

Introducing UDA

UDA (Unified Data Architecture) is the foundation for connected data in Content Engineering. It enables teams to model domains once and represent them consistently across systems — powering automation, discoverability, and semantic interoperability.

Using UDA, users and systems can:

Register and connect domain models — formal conceptualizations of federated business domains expressed as data.

Why? So everyone uses the same official definitions for business concepts, which avoids confusion and stops different teams from rebuilding similar models in conflicting ways.

Catalog and map domain models to data containers, such as GraphQL type resolvers served by a Domain Graph Service, Data Mesh sources, or Iceberg tables, through their representation as a graph.

Why? To make it easy to find where the actual data for these business concepts lives (e.g., in which specific database, table, or service) and understand how it’s structured there.

Transpile domain models into schema definition languages like GraphQL, Avro, SQL, RDF, and Java, while preserving semantics.

Why? To automatically create consistent technical data structures (schemas) for various systems directly from the domain models, saving developers manual effort and reducing errors caused by out-of-sync definitions.

Move data faithfully between data containers, such as from federated GraphQL entities to Data Mesh (a general purpose data movement and processing platform for moving data between Netflix systems at scale), Change Data Capture (CDC) sources to joinable Iceberg Data Products.

Why? To save developer time by automatically handling how data is moved and correctly transformed between different systems. This means less manual work to configure data movement, ensuring data shows up consistently and accurately wherever it’s needed.

Discover and explore domain concepts via search and graph traversal.

Why? So anyone can more easily find the specific business information they’re looking for, understand how different concepts and data are related, and be confident they are accessing the correct information.

Programmatically introspect the knowledge graph using Java, GraphQL, or SPARQL.

Why? So developers can build smarter applications that leverage this connected business information, automate more complex data-dependent workflows, and help uncover new insights from the relationships in the data.

This post introduces the foundations of UDA as a knowledge graph, connecting domain models to data containers through mappings, and grounded in an in-house metamodel, or model of models, called Upper. Upper defines the language for domain modeling in UDA and enables projections that automatically generate schemas and pipelines across systems.

Image of the UDA knowledge graph. A central node representing a domain model is connected to other nodes representing Data Mesh, GraphQL, and Iceberg data containers. — The same domain model can be connected to semantically equivalent data containers in the UDA knowledge graph.

This post also highlights two systems that leverage UDA in production:

Primary Data Management (PDM) is our platform for managing authoritative reference data and taxonomies. PDM turns domain models into flat or hierarchical taxonomies that drive a generated UI for business users. These taxonomy models are projected into Avro and GraphQL schemas, automatically provisioning data products in the Warehouse and GraphQL APIs in the Enterprise Gateway.

Sphere is our self-service operational reporting tool for business users. Sphere uses UDA to catalog and relate business concepts across systems, enabling discovery through familiar terms like ‘actor’ or ‘movie.’ Once concepts are selected, Sphere walks the knowledge graph and generates SQL queries to retrieve data from the warehouse, no manual joins or technical mediation required.

UDA is a Knowledge Graph

UDA needs to solve the data integration problem. We needed a data catalog unified with a schema registry, but with a hard requirement for semantic integration. Connecting business concepts to schemas and data containers in a graph-like structure, grounded in strong semantic foundations, naturally led us to consider a knowledge graph approach.

We chose RDF and SHACL as the foundation for UDA’s knowledge graph. But operationalizing them at enterprise scale surfaced several challenges:

RDF lacked a usable information model. While RDF offers a flexible graph structure, it provides little guidance on how to organize data into named graphs, manage ontology ownership, or define governance boundaries. Standard follow-your-nose mechanisms like owl:imports apply only to ontologies and don’t extend to named graphs; we needed a generalized mechanism to express and resolve dependencies between them.
SHACL is not a modeling language for enterprise data. Designed to validate native RDF, SHACL assumes globally unique URIs and a single data graph. But enterprise data is structured around local schemas and typed keys, as in GraphQL, Avro, or SQL. SHACL could not express these patterns, making it difficult to model and validate real-world data across heterogeneous systems.
Teams lacked shared authoring practices. Without strong guidelines, teams modeled their ontologies inconsistently breaking semantic interoperability. Even subtle differences in style, structure, or naming led to divergent interpretations and made transpilation harder to define consistently across schemas.
Ontology tooling lacked support for collaborative modeling. Unlike GraphQL Federation, ontology frameworks had no built-in support for modular contributions, team ownership, or safe federation. Most engineers found the tools and concepts unfamiliar, and available authoring environments lacked the structure needed for coordinated contributions.

To address these challenges, UDA adopts a named-graph-first information model. Each named graph conforms to a governing model, itself a named graph in the knowledge graph. This systematic approach ensures resolution, modularity, and enables governance across the entire graph. While a full description of UDA’s information infrastructure is beyond the scope of this post, the next sections explain how UDA bootstraps the knowledge graph with its metamodel and uses it to model data container representations and mappings.

Upper is Domain Modeling

Upper is a language for formally describing domains — business or system — and their concepts. These concepts are organized into domain models: controlled vocabularies that define classes of keyed entities, their attributes, and their relationships to other entities, which may be keyed or nested, within the same domain or across domains. Keyed concepts within a domain model can be organized in taxonomies of types, which can be as complex as the business or the data system needs them to be. Keyed concepts can also be extended from other domain models — that is, new attributes and relationships can be contributed monotonically. Finally, Upper ships with a rich set of datatypes for attribute values, which can also be customized per domain.

Visualization of the UDA graph representation of a One Piece character. The Character node in the graph is connected to a Devil Fruit node. The Devil Fruit node is connected to a Devil Fruit Type node. — *The graph representation of the onepiece: domain model from our UI. Depicted here you can see how Characters are related to Devil Fruit, and that each Devil Fruit has a type.*

Upper domain models are data. They are expressed as conceptual RDF and organized into named graphs, making them introspectable, queryable, and versionable within the UDA knowledge graph. This graph unifies not just the domain models themselves, but also the schemas they transpile to — GraphQL, Avro, Iceberg, Java — and the mappings that connect domain concepts to concrete data containers, such as GraphQL type resolvers served by a Domain Graph Service, Data Mesh sources, or Iceberg tables, through their representations. Upper raises the level of abstraction above traditional ontology languages: it defines a strict subset of semantic technologies from the W3C tailored and generalized for domain modeling. It builds on ontology frameworks like RDFS, OWL, and SHACL so domain authors can model effectively without even needing to learn what an ontology is.

Screenshot of UDA UI showing domain model for One Piece serialized as Turtle. — UDA domain model for One Piece. Link to full definition.

Upper is the metamodel for Connected Data in UDA — the model for all models. It is designed as a bootstrapping upper ontology, which means that Upper is self-referencing, because it models itself as a domain model; self-describing, because it defines the very concept of a domain model; and self-validating, because it conforms to its own model. This approach enables UDA to bootstrap its own infrastructure: Upper itself is projected into a generated Jena-based Java API and GraphQL schema used in GraphQL service federated into Netflix’s Enterprise GraphQL gateway. These same generated APIs are then used by the projections and the UI. Because all domain models are conservative extensions of Upper, other system domain models — including those for GraphQL, Avro, Data Mesh, and Mappings — integrate seamlessly into the same runtime, enabling consistent data semantics and interoperability across schemas.

Screenshot of an IDE. It shows Java code using the generated API from the Upper metamodel to traverse and print terms from a domain domain in the top while the bottom contains the output of an execution. — Traversing a domain model programmatically using the Java API generated from the Upper metamodel.

Data Container Representations

Data containers are repositories of information. They contain instance data that conform to their own schema languages or type systems: federated entities from GraphQL services, Avro records from Data Mesh sources, rows from Iceberg tables, or objects from Java APIs. Each container operates within the context of a system that imposes its own structural and operational constraints.

Screenshot of a UI showing details for a Data Mesh Source containing One Piece Characters. — A Data Mesh source is a data container.

Data container representations are data. They are faithful interpretations of the members of data systems as graph data. UDA captures the definition of these systems as their own domain models, the system domains. These models encode both the information architecture of the systems and the schemas of the data containers within. They provide a blueprint for translating the systems into graph representations.

Screenshot of an IDE showing two files open side by side. On the left is a system domain model for Data Mesh. On the right is a representation of a Data Mesh source containing One Piece Character data. — *Side by side/super imposed image of data container schema and representation.* *Link to full data container representation*.

UDA catalogs the data container representations into the knowledge graph. It records the coordinates and metadata of the underlying data assets, but unlike a traditional catalog, it only tracks assets that are semantically connected to domain models. This enables users and systems to connect concepts from domain models to the concrete locations where corresponding instance data can be accessed. Those connections are called Mappings.

Mappings

Mappings are data that connect domain models to data containers. Every element in a domain model is addressable, from the domain model itself down to specific attributes and relationships. Likewise, data container representations make all components addressable, from an Iceberg table to an individual column, or from a GraphQL type to a specific field. A Mapping connects nodes in a subgraph of the domain model to nodes in a subgraph of a container representation. Visually, the Mapping is the set of arcs that link those two graphs together.

Screenshot of UDA UI showing a mapping between a concept in UDA and a Data Mesh Source. — *A mapping between a domain model and a Data Mesh Source from the UDA UI.* *Link to full mapping*.

Mappings enable discovery. Starting from a domain concept, users and systems can walk the knowledge graph to find where that concept is materialized — in which data system, in which container, and even how a specific attribute or relationship is physically accessed. The inverse is also supported: given a data container, one can trace back to the domain concepts it participates in.

Mappings shape UDA’s approach to semantic data integration. Most existing schema languages are not expressive enough in capturing richer semantics of a domain to address requirements for data integration (for example, “accessibility of data, providing semantic context to support its interpretation, and establishing meaningful links between data”). A trivial example of this could be seen in the lack of built-in facilities in Avro to represent foreign keys, making it very hard to express how entities relate across Data Mesh sources. Mappings, together with the corresponding system domain models, allow for such relationships, and many other constraints, to be defined in the domain models and used programmatically in actual data systems.

Mappings enable intent-based automation. Data is not always available in the systems where consumers need it. Because Mappings encode both meaning and location, UDA can reason about how data should move, preserving semantics, without requiring the consumer to specify how it should be done. Beyond the cataloging use case, connecting to existing containers, UDA automatically derives canonical Mappings from registered domain models as part of the projection process.

Projections

A projection produces a concrete data container. These containers, such as a GraphQL schema or a Data Mesh source, implement the characteristics derived from a registered domain model. Each projection is a concrete realization of Upper’s denotational semantics, ensuring semantic interoperability across all containers projected from the same domain model.

Projections produce consistent public contracts across systems. The data containers generated by projections encode data contracts in the form of schemas, derived by transpiling a domain model into the target container’s schema language. UDA currently supports transpilation to GraphQL and Avro schemas.

The GraphQL transpilation produces a schema that adheres to the official GraphQL spec with the ability to generate all GraphQL types defined in the spec. Given that the UDA domain model can be federated, it also supports generating federated graphQL schemas. Below is an example of a transpiled GraphQL schema.

Screenshot of an IDE showing two files open side by side. On the left is the definition of a Character in UDA. On the right is transpiled GraphQL schema. — *Domain model on the left, with transpiled GraphQL schema on the right.* *Link to full transpiled GraphQL schema*.

The Avro transpilation produces a schema that is a Data Mesh flavor of Avro, which includes some customization on top of the official Avro spec. This schema is used to automatically create a Data Mesh source container. Below is an example of a transpiled Avro schema.

Screenshot of an IDE showing two files open side by side. On the left is the definition of a Devil Fruit in UDA. On the right is transpiled Avro schema. — *Domain model on the left, with transpiled Avro schema on the right.* *Link to full transpiled Avro schema*.

Projections can automatically populate data containers. Some projections, such as those to GraphQL schemas or Data Mesh sources produce empty containers that require developers to populate the data. This might be creating GraphQL APIs or pushing events onto Data Mesh sources. Conversely, other containers, like Iceberg Tables, are automatically created and populated by UDA. For Iceberg Tables, UDA leverages the Data Mesh platform to automatically create data streams to move data into tables. This process utilizes much of the same infrastructure detailed in this blog post here.

Projections have mappings. UDA automatically generates and manages mappings between the newly created data containers and the projected domain model.

Early Adopters

Controlled Vocabularies (PDM)

The full range of Netflix’s business activities relies on a sprawling data model that captures the details of our many business processes. Teams need to be able to coordinate operational activities to ensure that content production is complete, advertising campaigns are in place, and promotional assets are ready to deploy. We implicitly depend upon a singular definition of shared concepts, such as content production is complete. Multiple definitions create coordination challenges. Software (and humans) don’t know that the definitions mean the same thing.

We started the Primary Data Management (PDM) initiative to create unified and consistent definitions for the core concepts in our data model. These definitions form controlled vocabularies, standardized and governed lists for what values are permitted within certain fields in our data model.

Primary Data Management (PDM) is a single place where business users can manage controlled vocabularies. Our data model governance has been scattered across different tools and teams creating coordination challenges. This is an information management problem relating to the definition, maintenance and consistent use of reference data and taxonomies. This problem is not unique to Netflix, so we looked outward for existing solutions to this problem.

Screenshot of PDM UI — Managing the taxonomy of One Piece characters in PDM.

PDM uses the Simple Knowledge Organization System (SKOS) model. It is a W3C data standard designed for modeling knowledge. Its terminology is abstract, with Concepts that can be organized into ConceptSchemes and properties to describe various types of relationships. Every system is hardcoded against something, that’s how software knows how to manipulate data. We want a system that can work with a data model as its input, so we still need something concrete to build the software against. This is what SKOS provides, a generic basis for modeling knowledge that our system can understand.

PDM uses Domain Models to integrate SKOS into the rest of Content Engineering’s ecosystem. A core premise of the system is that it takes a domain model as input, and everything that can be derived is derived from that model. PDM builds a user interface based upon the model definition and leverages UDA to project this model into type-safe interfaces for other systems to use. The system will provision a Domain Graph Service (DGS) within our federated GraphQL API environment using a GraphQL schema that UDA projects from the domain model. UDA is also used to provision data movement pipelines which are able to feed our GraphSearch infrastructure as well as move data into the warehouse. The data movement systems use Avro schemas, and UDA creates a projection from the domain model to Avro.

Consumers of controlled vocabularies never know they’re using SKOS. Domain models use terms that fit in with the domain. SKOS’s generic notion of broader and narrower to define a hierarchy are hidden from consumers as super-properties within the model. This allows consumers to work with language that is familiar to them while enabling PDM to work with any model. The best of both worlds.

Operational Reporting (Sphere)

Operational reporting serves the detailed day-to-day activities and processes of a business domain. It is a reporting paradigm specialized in covering high-resolution, low-latency data sets.

Operational reporting systems should generate reports without relying on technical intermediaries. Operational reporting systems need to address the persistent challenge of empowering business users to explore and obtain the data they need, when they need it. Without such self-service systems, requests for new reports or data extracts often result in back-and-forth exchanges, where the initial query may not exactly meet business users’ expectations, requiring further clarification and refinement.

Data discovery and query generation are two relevant aspects of data integration. Supplying end-users with an accurate, contextual, and user-friendly data discovery experience provides a basis for query generation mechanism which produces syntactically correct and semantically reliable queries.

Operational reports are predominantly run on data hydrated from GraphQL services into the Data Warehouse. You can read about our journey from conventional data movement to streaming data pipelines based on CDC and GraphQL hydration in this blog post. Among the challenging byproducts of this approach was that a single, distinct data concept is now present in two places (GraphQL and data warehouse), with some disparity in semantic context to guide and support the interpretations and connectivity of that data. To address this, we formulate a mechanism to use the syntax and semantics captured in the federated schema from Netflix’s Enterprise GraphQL and populate representational domain models in UDA to preserve those details and add more.

Domain models enable the data discovery experience. Metadata aggregated from various data-producing systems is captured in UDA domain models using a unified vocabulary. This metadata is surfaced for the users’ search and discovery needs; instead of specifying exact tables and join keys, users simply can search for familiar business concepts such as ‘actors’ or ‘movies’. We use UDA models to disambiguate and resolve the intended concepts and their related data entities.

UDA knowledge graph is the data landscape for query generation. Once concepts are discovered and their mappings to corresponding data containers are identified and located in the knowledge graph, we use them to establish join strategies. Through graph traversal, we identify boundaries and islands within the data landscape. This ensures only feasible, joinable combinations are selected while weeding out semantically incorrect and non-executable query candidates.

Screenshot of Sphere’s UI — Generating a report in Sphere.

Sphere is a UDA-powered self-service operational reporting system. The solution based on knowledge graphs described above is called Sphere. Seeing self-service operational reporting through this lens, we can improve business users’ agency in access to operational data. They are empowered to explore, assemble, and refine reports at the conceptual level, while technical complexities are managed by the system.

Stay Tuned

UDA marks a fundamental shift in how we approach data modeling within Content Engineering. By providing a unified knowledge graph composed of what we know about our various data systems and the business concepts within them, we’ve made information more consistent, connected, and discoverable across our organization. We’re excited about future applications of these ideas such as:

Supporting additional projections like Protobuf/gRPC
Materializing the knowledge graph of instance data for querying, profiling, and management
Finally solving some of the initial challenges posed by Graph Search (that actually inspired some of this work)

If you’re interested in this space, we’d love to connect — whether you’re exploring new roles down the road or just want to swap ideas.

Expect to see future blog posts exploring PDM and Sphere in more detail soon!

Credits

Thanks to Andreas Legenbauer, Bernardo Gomez Palacio Valdes, Charles Zhao, Christopher Chong, Deepa Krishnan, George Pesmazoglou, Jessica Silva, Katherine Anderson, Malik Day, Rita Bogdanova, Ruoyun Zheng, Shawn Stedman, Suchita Goyal, Utkarsh Shrivastava, Yoomi Koh, Yulia Shmeleva

Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

AV1 @ Scale: Film Grain Synthesis, The Awakening

July 2, 2025, 7:21 am

≫ Next: Driving Content Delivery Efficiency Through Classifying Cache Misses

≪ Previous: Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix

Unleashing Film Grain Synthesis on Netflix and Enhancing Visuals for Millions

Li-Heng Chen, Andrey Norkin, Liwei Guo, Zhi Li, Agata Opalach and Anush Moorthy

Picture this: you’re watching a classic film, and the subtle dance of film grain adds a layer of authenticity and nostalgia to every scene. This grain, formed from tiny particles during the film’s development, is more than just a visual effect. It plays a key role in storytelling by enhancing the film’s depth and contributing to its realism. However, film grain is as elusive as it is beautiful. Its random nature makes it notoriously difficult to compress. Traditional compression algorithms struggle to manage it, often forcing a choice between preserving the grain and reducing file size.

In the digital age, noise remains a ubiquitous element in video content. Camera sensor noise introduces its own characteristics, while filmmakers often add intentional grain during post-production to evoke mood or a vintage feel. These elements create a visually rich experience that tests conventional compression methods.

We’re giving members globally a transformed streaming experience with the recent rollout of AV1 Film Grain Synthesis (FGS) streams. While FGS has been part of the AV1 standard since its inception, we only enabled it for a limited number of titles during our initial launch of the AV1 codec in 2021. Now, we’re enabling this innovative technology at scale, leveraging it to preserve the artistic integrity of film grain while optimizing data efficiency. In this blog post, we’ll explore how FGS revolutionizes video streaming and enhances your viewing experience.

Understanding Film Grain Synthesis in AV1

The AV1 Film Grain Synthesis tool models film grain through two key components, with model parameters estimated before the encoding of the denoised video:

Film Grain Pattern: an auto-regressive (AR) model is used to replicate the pattern of film grain. The key parameters are the AR coefficients, which can be estimated from the residual between the source video and the denoised video, essentially capturing the noise. This model captures the spatial correlation between the grain samples, ensuring that the noise characteristics of the original content are accurately preserved. By adjusting the auto-regressive coefficients {ai}, the model can control the grain’s shape, making it appear coarser or finer. With these coefficients, a 64x64 noise template is generated, as illustrated in the animation below. To construct the noise layer during playback, random 32x32 patches are extracted from the 64x64 noise template and added to the decoded video.

Fig. 1 The synthesis process of the 64x64 noise template using the simplest AR kernel with a lag parameter L=1. Each noise value is calculated as a linear combination of previously synthesized noise sample values, with AR coefficients a0, a1, a2, a3 and a white Gaussian noise (wgn) component.

Film Grain Intensity: a scaling function is employed to control the grain’s appearance under varying lighting conditions. This function, estimated during the encoding process, models the relationship between pixel value and noise intensity using a piecewise linear function. This allows for precise adjustments to the grain strength based on video brightness and color. Consequently, the film grain strength is adapted to the areas of the picture, closely recreating the look of the original video. The animation below demonstrates how the grain intensity is adjusted by the scaling function:

Fig. 2 Illustration of the scaling function’s impact on film grain intensity. Left: The scaling function graph showing the relationship between pixel value and scaling intensity. Right: A grayscale SMPTE bars frame with film grain applied according to the scaling function.

With these models specified by AV1 standard, the encoding process first removes the film grain from the video. The standard does not mandate a specific method for this step, allowing users to choose their preferred denoiser. Following the denoising, the video is compressed, and the grain’s pattern and intensity are estimated and transmitted alongside the compressed video data. During playback, the film grain is recreated and reintegrated into the video using a block-based method. This approach is optimized for consumer devices, ensuring smooth playback and high-quality visuals. For a more detailed explanation, please refer to the original paper.

By combining these components, the AV1 Film Grain Synthesis tool preserves the artistic integrity of film grain while making the content “easier to compress” by denoising the source video prior to encoding. This process enables high-quality video streaming, even in content with heavy grain, resulting in significant bitrate savings and improved visual quality.

Visual Quality Improvement, Bitrate Reduction, and Member Benefits

In our pursuit of premium streaming quality, enabling AV1 Film Grain Synthesis has led to significant bitrate reduction, allowing us to deliver high-quality video with less data while preserving the artistic integrity of film grain. Below, we showcase visual examples highlighting the improved quality and reduced bitrate, using a frame from the Netflix title They Cloned Tyrone:

A source video frame from *They Cloned Tyrone*

The visual comparison highlights a significant bitrate reduction of approximately 66%, with regular AV1 encoding at 8274 kbps compared to AV1 with FGS at 2804 kbps. In this example, which features strong film grain, it may be observed that the regular version exhibits distorted noise with a discrete cosine transform (DCT)-like pattern. In contrast, the FGS version preserves the integrity of the film grain at a lower bitrate.

Additionally, synthesized noise effectively masks compression artifacts, resulting in a more visually appealing experience. In this comparison below, both the regular AV1 stream and the AV1 FGS stream without synthesized noise (equivalent to compressing the denoised video) show compression artifacts. In contrast, the AV1 FGS stream with grain synthesis (rightmost figure) improves visual quality through contrast masking in human visual systems. The added film grain, a form of mask, effectively conceals some compression artifacts.

Cropped frame comparison: Regular AV1 stream (Left), AV1 FGS stream **without** grain synthesis during decoding (Middle), and AV1 FGS stream with grain synthesis (Right).

Currently, we lack a dedicated quality model for film grain synthesis. The noise appearing at different pixel locations between the source and decoded video poses challenges for pixelwise comparison methods like PSNR or VMAF, leading to penalized quality scores. Despite this, our internal assessment highlights the improvements in visual quality and the value of these advancements.

To evaluate the impact of AV1 Film Grain Synthesis, we selected approximately 300 titles from the Netflix catalog, each with varying levels of graininess. The bar chart below illustrates a 36% reduction in average bitrate for resolutions of 1080p and above when AV1 film grain synthesis is enabled, highlighting its efficacy in optimizing data usage. For resolutions below 1080p, the reduction in bitrate is relatively small, reaching only a 10% decrease, likely because noise is filtered out during the downscaling process. Furthermore, enabling the film grain synthesis coding tool consistently introduces syntax overhead to the bitstream.

Fig. 3: Comparison of average values across resolution categories between regular AV1 streams (without film grain synthesis) and AV1 streams with film grain synthesis enabled.

Finally, we conducted A/B testing prior to rollout to understand the overall streaming impact of enabling AV1 Film Grain Synthesis. This testing showcased a smoother and more reliable Quality of Experience (QoE) for our members. The improvements include:

Lower Initial and Average Bitrate: Bitrate at the start of the playback reduced by 24% and average bitrate by 31.6%, lower network bandwidth requirements and reduced storage needs for downloaded streams.
Decreased Playback Errors: Playback error rate reduced by approximately 3%.
Reduced Rebuffering: 10% fewer rebuffers and a 5% reduction in rebuffer duration resulting from the lower bitrate.
Faster Start Play: Start play delay reduced by 10%, potentially due to the lower bitrate, which may help devices reach the target buffer level more quickly.
Improved Playback Stability: Observed 10% fewer noticeable bitrate drops and a 10% reduction in the time users spend adjusting their playback position during video playback, likely influenced by reduced bitrate and rebuffering.
Higher Resolution Streaming: About 0.7% of viewing hours shifted from lower resolutions (≤ 1080p) to 2160p on 4K-capable devices. This shift is attributed to reduced bitrates at switching points, which make it easier to achieve the highest resolution during a session.

Behind the Scenes: Our Film Grain Adventure Continues

We’re always excited to share our progress with the community. This blog provides an overview of our journey: from the initial launch of the AV1 codec to the recent addition of Film Grain Synthesis (FGS) streams, highlighting the impact these innovations have on Netflix’s streaming quality. Since March, we’ve been rolling out FGS across scale, and many users can now enjoy the FGS-enabled streams, provided their device supports this feature. We encourage you to watch some of the author’s favorite titles The Hot Spot, Kung Fu Cult Master, Initial D, God of Gamblers II, Baahubali 2: The Conclusion, or Dept. Q (you may need to toggle off HDR from the settings menu) on Netflix to experience the new FGS streams firsthand.

In the next post, we will share how we did this in our video encoding pipeline, detailing the process and insights we’ve gained. Stay tuned to the Netflix Tech Blog for the latest updates.

Acknowledgments

This achievement is the result of a collaborative effort among several Open Connect teams at Netflix, including Video Algorithms, Media Encoding Pipeline, Media Foundations, Infrastructure Capacity Planning, and Open Connect Control Plane. We also received invaluable support from Client & Partner Technologies, Streaming & Discovery Experiences, Media Compute & Storage Infrastructure, Data Science & Engineering, and the Global Production Technology team. We would like to express our sincere gratitude to the following individuals for their contributions to the project’s success:

Prudhvi Kumar Chaganti and Ken Thomas for the discussion and assistance on rollout strategy
Poojarani Chennai Natarajan, Lara Deek , Ivan Ivanov, and Ishaan Shastri for their essential support in planning and operations for Open Connect.
Alex Chang for his support in everything related to data analysis, and Jessica Tweneboah and Amelia Taylor for their assistance with AB testing.
David Zheng, Janet Xue, Scott Bolter, Brian Li, Allan Zhou, Vivian Li, Sarah Kurdoghlian, Artem Danylenko, Greg Freedman, and many other dedicated team members played a crucial role in device certification and collaboration with device partners. Their efforts significantly improved compatibility across platforms. (Spoiler alert: this was one of the biggest challenges we faced for productizing AV1 FGS!)
Javier Fernandez-Ivern and Ritesh Makharia expertly managed the playback logic
Joseph McCormick and JD Vandenberg for providing valuable insights from a content production point of view, and Alex ‘Ally’ Michaelson for assisting in monitoring customer service.
A special thanks to Roger Quero, who played a key role in supporting various aspects of the project and contributed significantly to its overall success while he was at Netflix.

AV1 @ Scale: Film Grain Synthesis, The Awakening was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Driving Content Delivery Efficiency Through Classifying Cache Misses

July 2, 2025, 8:20 am

≫ Next: Netflix Tudum Architecture: from CQRS with Kafka to CQRS with RAW Hollow

≪ Previous: AV1 @ Scale: Film Grain Synthesis, The Awakening

By Vipul Marlecha, Lara Deek, Thiara Ortiz

The mission of Open Connect, our dedicated content delivery network (CDN), is to deliver the best quality of experience (QoE) to our members. By localizing our Open Connect Appliances (OCAs), we bring Netflix content closer to the end user. This is achieved through close partnerships with internet service providers (ISPs) worldwide. Our ability to efficiently localize traffic, known as Content Delivery Efficiency, is a critical component of Open Connect’s service.

In this post, we discuss one of the frameworks we use to evaluate our efficiency and identify sources of inefficiencies. Specifically, we classify the causes of traffic not being served from local servers, a phenomenon that we refer to as cache misses.

Why does Netflix have the Open Connect Program?

The Open Connect Program is a cornerstone of Netflix’s commitment to delivering unparalleled QoE for our customers. By localizing traffic delivery from Open Connect servers at IX or ISP sites, we significantly enhance the speed and reliability of content delivery. The inherent latencies of data traveling across physical links, compounded by Internet infrastructure components like routers and network stacks, can disrupt a seamless viewing experience. Delays in video start times, reduced initial video quality, and the frustrating occurrence of buffering lead to an overall reduction in customer QoE. Open Connect empowers Netflix to maintain hyper-efficiency, ensuring a flawless client experience for new, latency-sensitive, on-demand content such as live streams and ads.

Our custom-built servers, known as Open Connect Appliances (OCAs), are designed for both efficiency and cost-effectiveness. By logging detailed historical streaming behavior and using it to model and forecast future trends, we hyper-optimize our OCAs for long-term caching efficiency. We build methods to efficiently and reliably store, stream, and move our content.

The mission of Open Connect hinges on our ability to effectively localize content on our OCAs globally, despite limited storage space, and also by design with specific storage sizes. This ensures that our cost and power efficiency metrics continue to improve, enhancing client QoE and reducing costs for our ISP partners. A critical question we continuously ask is: How do we evaluate and monitor which bytes should have been served from local OCAs but resulted in a cache miss?

The Anatomy of a Playback Request

Let us start by introducing the logic that directs or “steers” a specific Netflix client device to its dedicated OCA. The lifecycle from when a client device presses play until the video starts being streamed to that device is referred to as “playback.” Figure 1 illustrates the logical components involved in playback.

Figure 1: Components for Playback

The components involved in playback are important to understand as we elaborate on the concept of how we determine a cache miss versus hit. Independent of client requests, every OCA in our CDN periodically reports its capacity and health, learned BGP routes, and current list of stored files. All of this data is reported to the Cache Control Service (CCS). When a member hits the play button, this request is sent to our AWS services, specifically the Playback Apps service. After Playback Apps determines which files correspond to a specific movie request, it issues a request to “steer” the client’s playback request to OCAs via the Steering Service. The Steering Service in turn, using the data reported from OCAs to CCS as well as other client information such as geo location, identifies the set of OCAs that can satisfy that client’s request. This set of OCAs is then returned in the form of rank-ordered URLs to the client device, the client connects to the top-ranked OCA and requests the files it needs to begin the video stream.

What is a Cache Miss?

A cache miss occurs when bytes are not served from the best available OCA for a given Netflix client, independent of OCA state. For each playback request, the Steering Service computes a ranked list of local sites for the client, ordered by network proximity alone. This ranked list of sites is known as the “proximity rank.” Network proximity is determined based on the IP ranges (BGP routes) that are advertised by our ISP partners. Any OCA from the first “most proximal” site on this list is the most preferred and closest, having advertised the longest, most specific matching prefix to the client’s IP address. A cache miss is logged when bytes are not streamed from any OCA at this first local site, and we log when and why that happens.

It is important to note that our concept of cache misses is viewed from the client’s perspective, focusing on the optimal delivery source for the end user and prepositioning content accordingly, rather than relying on traditional CDN proxy caching mechanisms. Our “prepositioning” differentiator allows us to prioritize client QoE by ensuring content is served from the most optimal OCA.

We attribute cache misses to three logical categories. The intuition behind the delineated categories is that each category informs parallel strategies to achieve content delivery efficiency.

Content Miss: This happens when the files were not found on OCAs in the local site. In previous articles like “Content Popularity for Open Connect” and “Distributing Content to Open Connect,” we discuss how we decide what content to prioritize populating first onto our OCAs. A sample of efforts this insights informs include: (1) how accurately we predict the popularity of content, (2) how rapidly we pre-position that content, (3) how well we design our OCA hardware, and (4) how well we provision storage capacity at our locations of presence.
Health Miss: This happens when the local site’s OCA hardware resources are becoming saturated, and one or more OCA can not handle more traffic. As a result, we direct clients to other OCAs with capacity to serve that content. Each OCA has a control loop that monitors its bottleneck metrics (such as CPU, disk usage, etc.) and assesses its ability to serve additional traffic. This is referred to as “OCA health.” Insight into health misses informs efforts such as: (1) how well we load balance traffic across OCAs with heterogeneous hardware resources, (2) how well we provision enough copies of highly popular content to distribute massive traffic, which is also tied to how accurately we predict the popularity of content, and (3) how well we preposition content to specific hardware components with varying traffic serve capabilities and bottlenecks.

Next we will dig into the framework we built to log and compute these metrics in real-time, with some extra attention to technical detail.

Cache Miss Computation Framework

Logging Components

There are two critical data components that we log, gather, and analyze to compute cache misses:

Steering Playback Manifest Logs: Within the Steering Service, we compute and log the ranked list of sites for each client request, i.e. the “proximity rank” introduced earlier. We also enrich that list with information that reflects the logical decisions and filters our algorithms applied across all proximity ranks given that point-in-time state of our systems. This information allows us to replay/simulate any hypothetical scenario easily, such as to evaluate whether an outage across all sites in the first proximity rank would overwhelm sites in the second proximity rank, and many more such scenarios!
OCA Server Logs: Once a Netflix client connects with an OCA to begin video streaming, the OCAs log any data regarding that streaming session, such as the files streamed and total bytes. All OCA logs are consolidated to identify which OCA(s) each client actually watched its video stream from, and the amount of content streamed.

The above logs are joined for every Netflix client’s playback request to compute detailed cache miss metrics (in bytes and hours streamed) at different aggregation levels (such as per OCA, movie, file, encode type, country, and so on).

System Architecture

Figure 2 outlines how the logging components fit into the general engineering architecture that allows us to compute content miss metrics at low-latency and almost real-time.

Figure 2: Components of the cache miss computation framework.

We will now describe the system requirements of each component.

Log Emission: The logs for computing cache miss are emitted to Kafka clusters in each of our evaluated AWS regions, enabling us to send logs with the lowest possible latency. After a client device makes a playback request, the Steering Service generates a steering playback manifest, logs it, and sends the data to a Kafka cluster. Kafka is used for event streaming at Netflix because of its high-throughput event processing, low latency, and reliability. After the client device starts the video stream from an OCA, the OCA stores information about the bytes served for each file requested by each unique client playback stream. This data is what we refer to as OCA server logs.
Log Consolidation: The logs emitted by the Steering Service and the OCAs can result in data for a single playback request being distributed across different AWS regions, because logs are recorded in geographically distributed Kafka clusters. OCA server logs might be stored in one region’s Kafka cluster while steering playback manifest logs are stored in another. One approach to consolidate data for a single playback is to build complex many-to-many joins. In streaming pipelines, performing these joins requires replicating logs across all regions, which leads to data duplication and increased complexity. This setup complicates downstream data processing and inflates operational costs due to multiple redundant cross-region data transfers. To overcome these challenges, we perform a cross-region transfer only once, consolidating all logs into a single region.
Log Enrichment: We enrich the logs during streaming joins with metadata using various slow-changing dimension tables and services so that we have the necessary information about the OCA and the played content.
Streaming Window-Based Join: We perform a streaming window-based join to merge the steering playback manifest logs with the OCA server logs. Performing enrichment and log consolidation upstream allows for more seamless and un-interrupted joining of our log data sources.
Cache Miss Calculations: After joining the logs, we compute the cache miss metrics. The computation checks whether the client played content from an OCA in the first site listed in the steering playback manifest’s proximity rank or from another site. When a video stream occurs at a higher proximity rank, this indicates that a cache miss occurred.

Data Model to Evaluate Cache Misses

One of the most exciting opportunities we have enabled through these logs (in these authors’ opinions) is the ability to replay our logic offline and in simulations with variable parameters, to reproduce impact in production under different conditions. This allows us to test new conditions, features, and hypothetical scenarios without impacting production Netflix traffic.

To achieve the above, our data should satisfy two main conditions. First, the data should be comprehensive in representing the state of each distinct logical step involved in steering, including the decisions and their reasons. In order to achieve this, the underlying logic, here the Steering Service, needs to be built in a modularized fashion, where each logical component overlays data from the prior component, resulting in a rich blurb representing the system’s full state, which is finally logged. This all needs to be achieved without adding perceivable latency to client playback requests! Second, the data should be in a format that allows near-real-time aggregate metrics for monitoring purposes.

Some components of our final, joined data model that enables us to collect rich insights in a scalable and timely manner are listed in Table 1.

Table 1: Unified Data Model after joining steering playback manifest and OCA server logs.

Cache Miss Computation Sample

Let us share an example of how we compute cache miss metrics. For a given unique client play request, we know we had a cache miss when the client streams from an OCA that is not in the client’s first proximity rank. As you can see from Table 1, each file needed for a client’s video streaming session is linked to routable OCAs and their corresponding sites with a proximity rank. These are 0 based indexes with proximity rank zero indicating the most optimal OCA for the client. “Proximity Rank Zero” indicates that the client connected to an OCA in the most preferred site(s), thus no misses occurred. Higher proximity ranks indicate a miss has occurred. The aggregation of all bytes and hours streamed from non-preferred sites constitutes a missed opportunity for Netflix and are reported in our cache miss metrics.

Decision Labels and Bytes Sent

Sourced from the steering playback manifest logs, we record why we did not select an OCA for playback. These are denoted by:

“H”: Health miss.
“C”: Content miss.

Metrics Calculation and Categorization

For each file needed for a client’s video streaming session, we can categorize the bytes streamed by the client into different types of misses:

No Miss: If proximity rank is zero, bytes were streamed from the optimal OCA.
Health Miss (“H”): Miss due to the OCA reporting high utilization.
Content Miss (“C”): Miss due to the OCA not having the content available locally.

How are miss metrics used to monitor our efficiency?

Open Connect uses cache miss metrics to manage our Open Connect infrastructure. One of the team’s goals is to reduce the frequency of these cache misses, as they indicate that our members are being served by less proximal OCAs. By maintaining a detailed set of metrics that reveal the reasons behind cache misses, we can set up alerts to quickly identify when members are streaming from suboptimal locations. This is crucial because we operate a global CDN with millions of members worldwide and tens of thousands of servers.

The figure below illustrates how we track the volume of total streaming traffic alongside the proportion of traffic streamed from less preferred locations due to content shedding. By calculating the ratio of content shed traffic to total streamed traffic, we derive a content shed ratio:

content shed ratio = content shed traffic total streamed traffic

This active monitoring of content shedding allows us to maintain a tight feedback loop to ensure the efficacy of our deployment and prediction algorithms, streaming traffic, and the QoE of our members. Given that content shedding can occur for multiple reasons, it is essential to have clear signals indicating when it happens, along with known and automated remediation strategies, such as mechanisms to quickly deploy mispredicted content onto OCAs. When special intervention is necessary to minimize shedding, we use it as an opportunity to enhance our systems as well as to ensure they are comprehensive in considering all known failure cases.

Conclusion

Open Connect’s unique strategy requires us to be incredibly efficient in delivering content from our OCAs. We closely track miss metrics to ensure we are maximizing the traffic our members stream from most proximal locations. This ensures we are delivering the best quality of experience to our members globally.

Our methods for managing cache misses are evolving, especially with the introduction of new streaming types like Live and Ads, which have different streaming behaviors and access patterns compared to traditional video. We remain committed to identifying and seizing opportunities for improvement as we face new challenges.

Driving Content Delivery Efficiency Through Classifying Cache Misses was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Netflix Tudum Architecture: from CQRS with Kafka to CQRS with RAW Hollow

July 10, 2025, 12:31 pm

≪ Previous: Driving Content Delivery Efficiency Through Classifying Cache Misses

By Eugene Yemelyanau, Jake Grice

Introduction

Tudum.com is Netflix’s official fan destination, enabling fans to dive deeper into their favorite Netflix shows and movies. Tudum offers exclusive first-looks, behind-the-scenes content, talent interviews, live events, guides, and interactive experiences. “Tudum” is named after the sonic ID you hear when pressing play on a Netflix show or movie. Attracting over 20 million members each month, Tudum is designed to enrich the viewing experience by offering additional context and insights into the content available on Netflix.

Initial architecture

At the end of 2021, when we envisioned Tudum’s implementation, we considered architectural patterns that would be maintainable, extensible, and well-understood by engineers. With the goal of building a flexible, configuration-driven system, we looked to server-driven UI (SDUI) as an appealing solution. SDUI is a design approach where the server dictates the structure and content of the UI, allowing for dynamic updates and customization without requiring changes to the client application. Client applications like web, mobile, and TV devices, act as rendering engines for SDUI data. After our teams weighed and vetted all the details, the dust settled and we landed on an approach similar to Command Query Responsibility Segregation (CQRS). At Tudum, we have two main use cases that CQRS is perfectly capable of solving:

Tudum’s editorial team brings exclusive interviews, first-look photos, behind the scenes videos, and many more forms of fan-forward content, and compiles it all into pages on the Tudum.com website. This content comes onto Tudum in the form of individually published pages, and content elements within the pages. In support of this, Tudum’s architecture includes a write path to store all of this data, including internal comments, revisions, version history, asset metadata, and scheduling settings.
Tudum visitors consume published pages. In this case, Tudum needs to serve personalized experiences for our beloved fans, and accesses only the latest version of our content.

The high-level diagram above focuses on storage & distribution, illustrating how we leveraged Kafka to separate the write and read databases. The write database would store internal page content and metadata from our CMS. The read database would store read-optimized page content, for example: CDN image URLs rather than internal asset IDs, and movie titles, synopses, and actor names instead of placeholders. This content ingestion pipeline allowed us to regenerate all consumer-facing content on demand, applying new structure and data, such as global navigation or branding changes. The Tudum Ingestion Service converted internal CMS data into a read-optimized format by applying page templates, running validations, performing data transformations, and producing the individual content elements into a Kafka topic. The Data Service Consumer, received the content elements from Kafka, stored them in a high-availability database (Cassandra), and acted as an API layer for the Page Construction service and other internal Tudum services to retrieve content.

A key advantage of decoupling read and write paths is the ability to scale them independently. It is a well-known architectural approach to connect both write and read databases using an event driven architecture. As a result, content edits would eventually appear on tudum.com.

Challenges with eventual consistency

Did you notice the emphasis on “eventually?” A major downside of this architecture was the delay between making an edit and observing that edit reflected on the website. For instance, when the team publishes an update, the following steps must occur:

Call the REST endpoint on the 3rd party CMS to save the data.
Wait for the CMS to notify the Tudum Ingestion layer via a webhook.
Wait for the Tudum Ingestion layer to query all necessary sections via API, validate data and assets, process the page, and produce the modified content to Kafka.
Wait for the Data Service Consumer to consume this message from Kafka and store it in the database.
Finally, after some cache refresh delay, this data would eventually become available to the Page Construction service. Great!

By introducing a highly-scalable eventually-consistent architecture we were missing the ability to quickly render changes after writing them — an important capability for internal previews.

In our performance profiling, we found the source of delay was our Page Data Service which acted as a facade for an underlying Key Value Data Abstraction database. Page Data Service utilized a near cache to accelerate page building and reduce read latencies from the database.

This cache was implemented to optimize the N+1 key lookups necessary for page construction by having a complete data set in memory. When engineers hear “slow reads,” the immediate answer is often “cache,” which is exactly what our team adopted. The KVDAL near cache can refresh in the background on every app node. Regardless of which system modifies the data, the cache is updated with each refresh cycle. If you have 60 keys and a refresh interval of 60 seconds, the near cache will update one key per second. This was problematic for previewing recent modifications, as these changes were only reflected with each cache refresh. As Tudum’s content grew, cache refresh times increased, further extending the delay.

RAW Hollow

As this pain point grew, a new technology was being developed that would act as our silver bullet. RAW Hollow is an innovative in-memory, co-located, compressed object database developed by Netflix, designed to handle small to medium datasets with support for strong read-after-write consistency. It addresses the challenges of achieving consistent performance with low latency and high availability in applications that deal with less frequently changing datasets. Unlike traditional SQL databases or fully in-memory solutions, RAW Hollow offers a unique approach where the entire dataset is distributed across the application cluster and resides in the memory of each application process.

This design leverages compression techniques to scale datasets up to 100 million records per entity, ensuring extremely low latencies and high availability. RAW Hollow provides eventual consistency by default, with the option for strong consistency at the individual request level, allowing users to balance between high availability and data consistency. It simplifies the development of highly available and scalable stateful applications by eliminating the complexities of cache synchronization and external dependencies. This makes RAW Hollow a robust solution for efficiently managing datasets in environments like Netflix’s streaming services, where high performance and reliability are paramount.

Revised architecture

Tudum was a perfect fit to battle-test RAW Hollow while it was pre-GA internally. Hollow’s high-density near cache significantly reduces I/O. Having our primary dataset in memory enables Tudum’s various microservices (page construction, search, personalization) to access data synchronously in O(1) time, simplifying architecture, reducing code complexity, and increasing fault tolerance.

In our simplified architecture, we eliminated the Page Data Service, Key Value store, and Kafka infrastructure, in favor of RAW Hollow. By embedding the in-memory client directly into our read-path services, we avoid per-request I/O and reduce roundtrip time.

Migration results

The updated architecture yielded a monumental reduction in data propagation times, and the reduced I/O led to faster request times as an added bonus. Hollow’s compression alleviated our concerns about our data being “too big” to fit in memory. Storing three years’ of unhydrated data requires only a 130MB memory footprint — 25% of its uncompressed size in an Iceberg table!

Writers and editors can preview changes in seconds instead of minutes, while still maintaining high-availability and in-memory caching for Tudum visitors — the best of both worlds.

But what about the faster request times? The diagram below illustrates the before & after timing to fulfil a request for Tudum’s home page. All of Tudum’s read-path services leverage Hollow in-memory state, leading to a significant increase in page construction speed and personalization algorithms. Controlling for factors like TLS, authentication, request logging, and WAF filtering, homepage construction time decreased from ~1.4 seconds to ~0.4 seconds!

An attentive reader might notice that we have now tightly-coupled our Page Construction Service with the Hollow In-Memory State. This tight-coupling is used only in Tudum-specific applications. However, caution is needed if sharing the Hollow In-Memory Client with other engineering teams, as it could limit your ability to make schema changes or deprecations.

Key Learnings

CQRS is a powerful design paradigm for scale, if you can tolerate some eventual consistency.
Minimizing the number of sequential operations can significantly reduce response times. I/O is often the main enemy of performance.
Caching is complicated. Cache invalidation is a hard problem. By holding an entire dataset in memory, you can eliminate an entire class of problems.

In the next episode, we’ll share how Tudum.com leverages Server Driven UI to rapidly build and deploy new experiences for Netflix fans. Stay tuned!

Credits

Thanks to Drew Koszewnik, Govind Venkatraman Krishnan, Nick Mooney, George Carlucci

Netflix Tudum Architecture: from CQRS with Kafka to CQRS with RAW Hollow was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧