Abstract: This paper is shared by Xu Zhenzhong, Senior Software Engineer of Netflix. The content includes interesting cases, various challenges and solutions of distributed system foundation. In addition, it discusses the harvest in the process of development and operation, some new visions of open self-service real-time data platform, and some new thoughts on realtime ETL basic platform. This paper is divided into three parts
- Product background
- Product function
- Challenges & Solutions
Netflix is committed to the joy of its members. We are constantly focused on improving the product experience and high quality content. In recent years, we have invested heavily in technology driven studio and content production. In the process, we found many unique and interesting challenges in the field of real-time data platforms. For example, in microservice architecture, domain objects are distributed in different apps and their stateful stores, which makes real-time reporting and entity search discovery with low latency and high consistency particularly challenging.
Netflix’s long-term vision is to bring joy and smile to the whole world. By shooting some high-quality and diversified content products around the world and putting them on the platform, Netflix will share them with more than 100 million users on the platform. In order to bring users a pleasant experience, Netflix’s efforts are divided into two parts:
- On the one hand, data integration knowledge is used to feedback and improve the user’s product experience;
- On the other hand, through the establishment of a technology driven studio to help produce higher quality products.
As a data platform team, we need to pay attention to how to help different developers and data analysts realize their value in the company, and finally make their own contribution to solve the above two problems.
Briefly introduce the Netflix data platform team and its corresponding product, keystone. Its main function is to help the company bury points in all microservices, set up agents, publish events, collect event information, and then store them in different data warehouses, such as hive or elasticsearch. Finally, it helps users realize calculation and analysis in the case of real-time data storage.
- fromuserKeystone is a complete self-contained platform that supports multiple users. Users can easily declare and create the pipeline they want through the UI provided.
- fromplatformFrom a perspective, keystone provides solutions that are difficult to implement in all underlying distributed systems, such as container orchestration, workflow management, etc., which are invisible to users.
- fromproductThere are two main functions: one is to help users move data from edge devices to the data warehouse; the other is to help users calculate in real time.
- fromnumberFrom the perspective of Netflix, keystone’s use in Netflix is very necessary. As long as developers deal with data, they will certainly use it. Therefore, keystone has thousands of users in the whole company, and 100 Kafka clusters support about 10PB of data per day.
The whole architecture of keystone is divided into two layers. The bottom layer is Kafka and Flink as the bottom engine. The bottom layer abstracts the technical solutions in all distributed systems, which are invisible to users, so the whole application is built in the upper layer. The service layer provides abstract services, and the UI is relatively simple for users, and does not need to care about the underlying implementation.
Here is a brief introduction to the development of keystone products in the past four or five years. The original motivation was to collect the data of all devices and store them in the data warehouse. At that time, Kafka technology was used. Because data movement is easy to solve, it is only a multi concurrency problem in essence.
After that, the user gave a new requirement, that is, some simple data processing operations, such as filter, and a very general function — project, were put forward by keystone, aiming at this demand, keystone launched corresponding functional features.
After a period of time, users express that they want to do more complex ETL, such as streaming join. Therefore, the product decides to provide the bottom-level API to users and abstract the underlying solutions for all distributed systems, so that they can pay more attention to the content of the upper layer.
The introduction of product features will focus on two “superheroes” Elliot and Charlie in Netflix. Elliot is a data scientist from the data science engineering organization. His demand is to find responsive patterns in very large data to help improve the user experience; Charlie is an application developer from studio, whose goal is to help other developers around to produce higher quality products by developing a series of applications.
The work of these two people is very important for the product. Elliot’s data analysis results can help to give better recommendations and personalized customization, and ultimately improve the user experience; while Charlie’s work can help the surrounding developers improve efficiency.
Recommendation & Personalization
Elliot, as a data scientist, needs a simple and easy-to-use real-time ETL operating platform. He does not want to write very complex codes, and at the same time, he needs to ensure the low latency of the whole pipeline. His work and related needs are as follows:
- Recommendation and customization。 In this work, the same video can be pushed to the corresponding users in different forms according to different personal characteristics. The video can be divided into multiple rows, and each row can be classified differently. Different rows can be changed according to personal preferences. In addition, each video title will have an artwork, and different users in different countries and regions may have different preferences for artwork. They will also calculate and customize artwork suitable for users through algorithm.
- A/B Testing。 Netflix provides non member users with 28 days of free video viewing opportunities. At the same time, it also believes that users will be more likely to buy Netflix services if they see the video suitable for them. However, it takes 28 days to complete a / B testing. Elliot may make mistakes in a / B testing. What he cares about is how to find problems in advance without waiting for the end of 28 days.
When watching Netflix on the device, it will interact with the gateway in the form of requests, and then the gateway will distribute these requests to the back-end microservices. For example, users can click play, pause, fast forward, fast backward and other operations on the device. These will be processed by different microservices. Therefore, the corresponding data needs to be collected for further processing.
For keystone platform team, it is necessary to collect and store the data generated in different microservices. Elliot needs to integrate different data to address his concerns.
As for why to use stream processing, there are four main considerations: real-time report, real-time alarm, fast training of machine learning model and resource efficiency. Compared with the first two points, the rapid training of machine learning model and resource efficiency are more important to Elliot’s work. In particular, resource efficiency should be emphasized. For the previous 28 days of a / B testing, the current practice is to batch process the data with the previous 27 days every day. This process involves a lot of repeated processing, and the use of stream processing can help improve the overall efficiency.
Keystone will provide users with command-line tools. Users only need to input corresponding commands in the command line to operate. The tool will ask users some simple questions at the beginning, such as what repository to use After the user gives the corresponding answer, a template will eventually be generated, and the user can start to use the tool for development. The product also provides a series of simple SDKs, which currently support hive, iceberg, Kafka and elasticsearch.
What needs to be emphasized is iceberg, which is a table format dominated by Netflix, and plans to replace hive in the future. It provides a lot of features to help users optimize; Keystone provides users with a simple API, which can help them generate source and sink directly.
After completing a series of work, Elliot can choose to submit its own code to the repository. The background will automatically start a CI / CD pipeline to package all source code and artifacts in the docker image to ensure the consistency of all versions. Click the Elliot button to deploy to the production environment.
The product will help it solve the difficult problems of the underlying distributed system in the background, such as how to arrange containers. At present, it is based on resources, and plans to develop towards k8s in the future. In the process of job package deployment, a job manager cluster and a task manager cluster will be deployed, so each job is completely independent for users.
The product provides default configuration options. At the same time, it also supports users to modify and override configuration information on the platform UI. The deployment can take effect directly without rewriting the code. Elliot had a requirement in the process of stream processing, such as reading data from different topics. In case of problems, it may be necessary to operate in Kafka or in the data warehouse. In the face of this problem, it needs to switch between different topics without changing the code Source, and the UI provided by the current platform is very convenient to complete this requirement. In addition, the platform can help users choose how many resources they need to run jobs when they deploy.
In the process of transferring from batch processing to stream processing, many users already have many required artifacts, such as schema. Therefore, the platform also helps them to integrate these artifacts simply.
The platform has many users who need to write ETL projects on it. When there are more and more users, the scalability of the platform is particularly important. Therefore, the platform adopts a series of patterns to solve the problem. Specifically, there are three patterns in use, namely extractor pattern, join pattern and enrichment pattern.
Let’s briefly introduce what content production is. It includes forecasting the cost of video production, developing program, achieving deal, producing video, video post-processing, releasing video and financial report.
Charlie is in the studio department, which is responsible for developing a series of applications to help support content production. Each application is developed and deployed based on microservice architecture, and each microservice application will have its own responsibilities. For example, there will be micro service applications that specifically manage movie titles, and there will also be micro service applications that specifically manage deals and contracts.
Facing so many microservice applications, Charlie faces the challenge of joining data from different places in the process of real-time search, such as the actor of a certain movie In addition, the data is increasing every day, so it is difficult to ensure the consistency of real-time updated data. This is essentially caused by the characteristics of distributed microservice systems. Different microservices may choose to use different databases, which adds a certain complexity to the guarantee of data consistency. There are three common solutions to this problem
- Dual writes:When developers know that the data needs to be put into the main database, they should also put it into another database. They can simply choose to write the data to the database twice. However, this operation is not allowed to be wrong. Once an error occurs, it is likely to cause data inconsistency;
- Change Data Table: the concept of transaction needs to be supported by the database. No matter what operation is done to the database, the corresponding changes will be added to the statement of transaction changes and stored in a separate table. After that, you can query the change table and obtain the corresponding changes and synchronize them to other data tables;
- Distributed Transaction: refers to distributed transactions, which are more complex to implement in a multi data environment.
One of Charlie’s requirements is to copy all movies from movie datastore to a movie search index supported by elasticsearch. Data pulling and copying are mainly done through a polling system. The above method of change data table is used to ensure data consistency.
The disadvantage of this scheme is that it only supports periodic data fetching. In addition, polling system is directly combined with data sources. Once the schema of movie search datastore changes, the polling system needs to be modified. For this reason, the architecture has been improved later by introducing event driven mechanism to read all the implemented transactions in the database and transfer them to the next job for processing through stream processing. In order to popularize the solution, CDC (change data capture) support for different databases is implemented in the source side, including mysql, PostgreSQL, Cassandra and other commonly used databases in Netflix, which are processed through keystone pipeline.
Challenges and Solutions
Let’s share the challenges and corresponding solutions of the above solutions:
- Ordering Semantics
In the event of changing data, it is necessary to ensure the event ordering. For example, an event contains three operations: create, update and delete, and it needs to return to the consumer an operation event that strictly abides by the order. One solution is to control through Kafka; the other is to ensure that the captured events are consistent with the actual order of data read from the database in the distributed system. In this solution, when all the change events are captured, there will be duplication and disorder, which will be de duplicated and reordered by Flink.
- Processing Contracts
When writing stream processing, the specific information of schema is not known in many cases. Therefore, it is necessary to define a contract contract on the message, including wire format, and define schema related information at different levels, such as infrastructure and platform. The purpose of processor contract is to help users combine different processor metadata to minimize the possibility of writing duplicate code.
Take a specific case. For example, Charlie hopes to be informed when there is a new deal. The platform helps it to realize an open composable streaming data platform by combining different related components, DB connector, filter, etc. through user-defined contracts.
Most of the ETL projects seen in the past are suitable for data engineers or data scientists. But from experience, the whole process of ETL, namely extract, transform and load, has the possibility of being widely used. The earliest keystone was simple and easy to use, but its flexibility was not high. In the later development process, although the flexibility was improved, the complexity also increased correspondingly. Therefore, the future team plans to further optimize on the current basis, and launch an open, cooperative, composable and configurable ETL engineering platform to help users solve problems in a very short time.
Brief introduction of the author:
Xu Zhenzhong, Netflix software engineer, is engaged in the infrastructure work of highly scalable and flexible streaming media data platform in Netflix. He is keen on researching and sharing any interesting things related to the basic principles of real-time data system and distributed system!