Author: Liu Dishan
This article is organized from the Flink meetup held on August 11 in Beijing, where Liu Dishan, a guest, joined the data platform of meituan in 2015. We are committed to building an efficient and easy-to-use real-time computing platform and exploring enterprise level solutions and unified services for real-time applications in different scenarios.
Current situation and background of meituan real time computing platform
Real time platform architecture
The figure above shows the brief architecture of the current meituan real-time computing platform. The bottom layer is the data cache layer. You can see that the data of all log classes of the meituan test is collected from Kafka through a unified log collection system. As the largest data transfer layer, Kafka supports a large number of businesses on meituan online, including offline pull and some real-time processing businesses. On top of the data cache layer is an engine layer. On the left side of this layer is our current real-time computing engine, including storm and Flink. Storm was the deployment mode of standalone mode before. Because of the current running environment of Flink, meituan chose on yarn mode. In addition to the computing engine, we also provide some real-time storage functions to store the intermediate status, calculation results, and dimension data of the calculation. At present, this type of storage includes HBase, redis and es. On top of the computing engine, there are various layers, which are mainly for the students of data development. The development of real-time data faces many problems, for example, it is more difficult to debug and optimize the program than the ordinary program development. At the data platform level, the real-time computing platform provided by meituan for users can not only host jobs, but also realize tuning diagnosis and monitoring alarm, as well as real-time data retrieval and permission management. In addition to providing real-time computing platform for data development students, what meituan is doing now includes building metadata center. This is also a premise for us to do SQL in the future. The metadata center is an important part of the real-time flow system. We can understand it as the brain in the real-time system, which can store the schema and meta of data. The top layer of the architecture is the business supported by our current real-time computing platform, which not only includes the real-time query and retrieval of online business logs, but also covers the current very popular real-time machine learning. Machine learning often involves search and recommendation scenarios, which have the most significant characteristics: first, massive real-time data will be generated; second, the QPS of traffic is quite high. At this time, the real-time computing platform is needed to carry out part of the real-time feature extraction work and realize the application of search recommendation service. There are also more common scenarios, including real-time feature aggregation, zebra watcher (which can be considered as a monitoring service), real-time data warehouse, etc.
The above is the brief architecture of the current real-time computing platform of meituan.
Real time platform status
The current situation of meituan real-time computing platform is that the workload has now reached nearly 10000, the scale of nodes in the cluster is thousand level, the daily message volume has reached trillion level, and the peak message volume can reach tens of millions of messages per second.
Pain points and problems
Before using Flink, meituan encountered some pain points and problems:
- Accuracy of real-time computing: before using Flink for research, meituan’s large-scale tasks were developed based on storm. The main computing semantics of storm is at least once, which actually has some problems in ensuring the correctness. Before Trident, storm was a stateless processing. Although storm Trident provides an accurate development of maintenance status, it is based on serial batch submission, so there may be a bottleneck in processing performance when encountering problems. Moreover, Trident is based on micro batch processing, which does not meet the higher delay requirements, so it cannot meet some services with higher delay requirements.
- State management in flow processing: Based on the previous state management in flow processing, it is a very large class of problems. State management will not only affect the consistency of computing state, but also affect the performance of real-time computing processing and the ability of fault recovery. One of the most outstanding advantages of Flink is state management.
- The limitation of real-time computing ability: before real-time computing, most of the data development of many companies is still oriented to offline scenarios. In recent years, real-time scenarios are also becoming increasingly popular. That’s different from offline processing. In real-time scenarios, the expressive ability of data processing may have certain limitations. For example, it needs to develop many functional things in order to carry out accurate calculation and time window.
- High cost of development and debugging: nearly ten thousand jobs have been run on the cluster of nearly one thousand nodes. The distributed processing engine and the way of writing code by hand also bring high cost of development and debugging to the students of data development. When they go to maintenance again, the operation and maintenance cost is relatively high.
Flink explore concerns
In the context of the above pain points and problems, meituan began to explore Flink last year, focusing on the following four aspects:
- Exactlyonce computing power
- State management capability
- Window / join / time processing, etc
Flink’s practice in meituan
Let’s take a look at the problems met by meituan in the production process last year and some solutions, which are divided into the following three parts:
Stability practice – resource isolation
1. Consideration of resource isolation: by scenario and by business
- The operation and maintenance time varies with the peak period;
- Different reliability and delay requirements;
- Application scenarios have different importance;
2. Strategy of resource isolation:
- Yarn is labeled and the nodes are physically isolated;
- Isolation between offline datanode and real-time computing node;
Stability practice – Intelligent Scheduling
The purpose of intelligent scheduling is to solve the problem of uneven resources. Now the common scheduling strategy is based on CPU and memory. In addition, some other problems have been found in the production process. For example, Flink will rely on the local disk for local state storage. So disk IO, as well as the capacity of the disk, is also a kind of consideration. In addition, it also includes the network card traffic, because the traffic status of each business is different, and allocation will lead to In the peak of traffic, one network card is full, which affects other services. So what we expect is to do some intelligent scheduling things. At present, what we can do for the time being is from the aspects of CPU and memory. In the future, we will make some better scheduling strategies from other aspects.
Stability practice fault tolerance
1. Node / network failure
- Automatic pull up
Different from storm, it’s very simple and crude to know that storm encounters exceptions. For example, if there is an exception, the user may not carry out standardized exception handling in the code, but it doesn’t matter, because the worker will restart the job and continue to execute, and he guarantees the semantics of at least once, such as a network timeout exception The impact may not be so great, but Flink is different in that his tolerance for exceptions is very harsh. At that time, it was considered that for example, node or network failures would occur, and the single point problem of jobmanager might be a bottleneck. If jobmanager fails, the impact on the whole job might be irrecoverable, so it was considered to do so Ha, the other is to consider some jobs caused by operation and maintenance factors. Besides, there may be some user jobs that do not have checkpoint turned on, but if they are hung due to node or network failure, we hope to make some automatic pull-up strategies in the platform inner layer to ensure the stability of job operation.
2. Upstream and downstream fault tolerance
- Flinkkafka08 exception retry
Our data source is mainly Kafka. Reading and writing Kafka is a very common content that cannot be avoided in real-time flow processing. Kafka’s cluster size is very large, so node failure is a normal problem. On this basis, we have made some fault tolerance for node failure, such as when the node is hung up or the data is balanced, the leader will switch, In that case, Flink’s reading and writing tolerance for leader switching is not so high. On this basis, we made some optimization for some specific scenarios and some special exceptions, and made some retries.
3. disaster recovery
- Multi computer room
- Hot standby
Disaster recovery may not be considered much, for example, whether all nodes in a computer room are hung up or can not be accessed. Although it is a small probability event, it will happen. So now we will also consider some deployment of long machine rooms, including some hot standby of Kafka.
Flink platform – job management
In practice, in order to solve some problems of job management and reduce some costs of user development, we have done some platform work. The figure below is an interface display of job submission, including job configuration, job life cycle management, alarm configuration and delay display, which are integrated in the real-time computing platform.
Flink platform monitoring alarm
In terms of monitoring, we have also done some things. For real-time jobs, the requirements for monitoring will be higher. For example, when jobs are delayed, the impact on the business will be greater. So we have made some delayed alarms, including the alarm of job status, such as the status of job survival, the status of job operation, and some custom metrics alarms in the future. Custom metrics is to make some configurable alarms based on the content of job processing in the future.
Flink platform – tuning diagnostics
- Real time computing engine provides unified logging and metrics solutions
- Provide conditional filtering log retrieval for business
- Provide index query of custom time span for business
- Provide configurable alarms for business based on logs and metrics
In addition, just mentioned that when developing real-time jobs, tuning and diagnosis is a relatively difficult pain point, that is, users are not difficult to view distributed logs, so a unified solution is also provided. This solution is mainly for logs and metrics. It will report some logs and metrics at the engine level. Then it will collect the original logs and metrics to Kafka level through a unified log collection system. In the future, you can find that Kafka has two downstream: on the one hand, it synchronizes the data from logs to es, so that it can enter the log center to retrieve some logs; on the other hand, it flows through some aggregation processing to write data to opentsdb to rely on. This aggregated data will do some queries, on the one hand, it is the query display of metrics, and on the other hand, it will do some queries On the other hand, it includes some related alarms.
The following figure is a query page of a certain job that can support metrics of cross sky dimension. We can see that if we can compare vertically, we can find out what is the cause of the job at a certain point in time? For example, delay is easy to help users judge some problems in their homework. In addition to the operation status of the job, the basic information of some nodes will be collected first as a horizontal comparison
The following figure shows some queries of the current log, which records that, since each application ID may change after the job is hung up, then all jobs are collected based on the unique primary key job name of the job. From the beginning of creation to the current running log, users can query the logs across applications.
In order to adapt to these two types of MQ, different things have been done. For online MQ, it is expected to do a synchronous multiple consumption, in order to avoid the impact on online business. For the production type of Kafka, it is offline Kafka. It has done some address shielding, and some basic configuration, including some permission management, and indicator collection.
The application of Flink in meituan
Here are two real use cases of Flink in meituan. The first is Petra. Petra is actually an aggregate system of real-time indicators. It is actually a unified solution for the company. Its main business scenario is based on the time statistics of business, as well as the calculation of some real-time indicators, which requires low delay. It has another one, that is to say, because it is aimed at general business, because the business may have different dimensions, each business may include the application channel computer room, and other applications There are some business specific dimensions, and these dimensions may involve a lot. The other is that the business needs to do some calculation of composite indicators, such as the most common transaction success rate, which may need to calculate the success amount of payment, and the proportion with the odd number. The other is that the unified index aggregation may be oriented to a system, such as some b-end or r-segment monitoring systems. Then the system’s demand for the index system is that I hope the index aggregation can produce some results in the most real-time and accurate way, and the data guarantees that its downstream system can truly monitor the current information Interest. The picture on the right is an example that I showed when I was a metrics. It can be seen that other indicators are similar to those just mentioned, that is, the aggregation results of some indicators including different dimensions of business.
Petra real time index aggregation
1. Business scenario:
- Based on business time (event time)
- Multi business dimensions: such as application, channel, computer room, etc
- Composite indicator calculation: for example, transaction success rate = payment success number / order number
- Low latency: second level result output
2. Accuracy guarantee of exactlyonce
- Flinkcheck point mechanism
3. Data skew in dimension calculation
- Hotspot key hash
4. Tolerance for late arrival data
- Window setting and resource tradeoff
When using Flink to do the system of real-time index review, it focuses on these aspects. The first aspect is accurate calculation, including using the mechanism of flick and checkpoint to ensure that I can do not lose heavy calculation. The first aspect is that unified metrics flows into a pre aggregation module, which mainly does some initial aggregation. Why does the pre aggregation and full aggregation solve a class of problems, including For example, when hotspot K occurs, the current solution is to do some buffering through pre aggregation, so that K can be broken up as much as possible, and then aggregate the full aggregation module to do aggregation. In fact, it can only solve a part of the problem, so it is also considered later that the optimization of performance includes exploring the performance of state storage. The following words still include the tolerance of late arrival data, because indicator aggregation may just mention that some composite indicators are to be included, so the data that meets the indicators may come from different flows. Even if it comes from the same flow, it may occur late arrival when each data is reported. At that time, it is necessary to make late arrival for data association On the one hand, lateness delay can be set, on the other hand, the length of window can be set, but in the real application scenario, on the other hand, it is considered that in addition to prolonging the time as much as possible, it is also considered the real calculation cost, so some trade-offs have been made in this respect, so the indicators are basically all After the quantity aggregation, the aggregation result will be written back to Kafka, written to opentsdb through the data synchronization module, and finally to grafana for the indicator display. On the other hand, it may be applied to the system that synchronizes to the alarm through the Facebook package synchronization module to do some indicators, indicator based alarm.
The figure below is a schematic diagram of a display of the product Petra. It can be seen that at present, some commonly used operators and dimension configurations are defined, allowing users to process configuration sessions, and directly get a display and aggregation result of the desired indicators. At present, I’m still exploring to do some things for Petra based on SQL. Because many users are also more accustomed to saying that I’m going to write SQL to complete such statistics, so I will also rely on Flink’s own support for SQLs and tableapi based on this, and will also make some exploration on SQL scenarios.
MLX machine learning platform
The second kind of application is a scene of machine learning, which may depend on off-line feature data and real-time feature data. One is based on the existing offline scene feature extraction, after batch processing, the flow to the offline cluster. The other is the near-line mode. The data generated by the near-line mode is the existing unified log transferred from the log collection system. After Flink processing, it includes flow Association and feature extraction, then model training, and transfer to the final training cluster. The training cluster will produce P features, as well as delta features, and finally shadow these features Ring to the online features of a training service. This is a common scenario. For example, comparison is a common scenario. At present, the main application may include search, recommendation and other businesses.
In the future, it is also expected to do more things in these three aspects. Just mentioned include the management of state. The first is the unification of state, such as the unified management of SQL. I hope there is a unified configuration to help users choose some expected rollback points. The other is the performance optimization of large state. For example, when doing the dual stream Association of some traffic data, there are also some performance bottlenecks. For example, the performance comparison between the memory based state, the memory based data processing, and the rocksdb based state processing shows that there are some big performance differences , so I hope that we can do more optimization on the rocksdbbackend to improve the performance of job processing. The second aspect is SQL. Every bit of SQL should be a direction that every company is probably doing at present, because there have been some explorations on SQL before, including providing some representation of SQL based on storm, but there may be some deficiencies in the representation of semantic and semantic in the previous words, so I hope that we can solve these problems based on Flink, As well as the optimization of some configurations including the concurrency of SQL, including the query of SQL, all hope to say that in the future, Flink can optimize more things to really enable SQL to be applied to the production environment.
On the other hand, we are also exploring new scenarios. For example, we just mentioned that in addition to streaming processing, we also expect to merge the data in offline scenarios, and provide more services to the business through the unified SQL API, including streaming processing and batch processing.
For more information, please visit the Apache Flink Chinese community website