This paper is shared by Ju Dasheng, a researcher of meituan and the person in charge of real-time computing. It mainly introduces the application practice of Flink in the incremental production of meituan data warehouse. The contents include:
- Data warehouse incremental production
- Streaming data integration
- Streaming data processing
- Application of streaming OLAP
- Future planning
1、 Data warehouse incremental production
1. Meituan data warehouse structure
Let’s first introduce the architecture of meituan data warehouse and incremental production. As shown in the figure below, this is a simple structure of meituan data warehouse. I call it three horizontal and four vertical. The so-called three horizontal links, the first is the metadata and consanguinity throughout the whole link, and the whole process of data integration, data processing, data consumption, and data application. The other part of the whole link is data security, including authentication system, authority system and overall audit system. According to the flow of data, we divide the process of data processing into four stages: data integration, data processing, data consumption and data application.
In the data integration stage, we have a corresponding integrated system for the company’s internal data, such as user behavior data, log data, DB data, and file data, to unify the data into our data processing storage, such as Kafka.
In the data processing stage, it is divided into stream processing link, batch processing link and data warehouse working platform (Vientiane platform) based on the link. The produced data is imported into the consumption storage through datalink, and finally presented in different forms through application.
At present, we have a wide range of applications in Flink, including the process of importing data from Kafka to hive, including real-time processing and data export. Today’s sharing focuses on these aspects.
2. Application of meituan Flink
Currently, meituan’s Flink has about 6000 physical machines, supporting about 30000 operations. The number of topics we consume is about 50000, and the peak traffic is 180 million per second.
3. Meituan Flink application scenarios
The main application scenarios of meituan Flink include four parts.
- First, real-time data warehouse, operation analysis, operation analysis and real-time marketing.
- Second, recommendation and search.
- Third, risk control and system monitoring.
- Fourth, safety audit.
4. Real time data warehouse vs incremental production
Next I will introduce the concept of incremental production. The first of the three requirements that offline data warehouse focuses on is timeliness. The second is quality, the quality of the data produced. The third is cost.
There are two deeper meanings of timeliness. The first is real-time, and the second is punctuality. Not all business needs are real-time, and most of the time our needs are punctual. For example, do business analysis and get the corresponding business data of yesterday every day. Real time data warehouse is more to solve the real-time needs. But in the area of punctuality, as an enterprise, it hopes to make a trade-off between punctuality and cost. Therefore, I define incremental production of data warehouse as a trade-off between punctuality and cost for offline data warehouse. In addition, one of the better aspects to solve the incremental production of data warehouse is the quality, and the problem can be found in time.
5. Advantages of incremental production
There are two advantages of incremental production.
- It can find data quality problems in time and avoid t + 1 repairing data.
- Make full use of resources, advance the data output time.
As shown in the figure below, what we expect to do is actually the second one. We expect to reduce the resources occupied by offline production, but at the same time, we hope that its output time can be one step ahead of time.
2、 Streaming data integration
1. Data integration v1.0
Let’s take a look at the first generation of streaming data integration. When the amount of data is very small and the library is very small, directly do a batch transmission system. In the early morning of every day, load all the corresponding DB data to the data warehouse. The advantage of this architecture is that it is very simple and easy to maintain, but its disadvantages are also very obvious. For some large dB or large data, it may take 2-3 hours to load data, which greatly affects the output time of offline data warehouse.
2. Data integration v2.0
Based on this architecture, we add a streaming transmission link. We will have a streaming transmission acquisition system to collect the corresponding binlog to Kafka. At the same time, we will import it into the original data through a Kafka 2 hive program, and then through a merge layer to produce the downstream ODS data.
The advantage of data integration v2.0 is very obvious. We put the time of data transmission on the day of T + 0, and only need to do a merge the next day. This time may be reduced from 2-3 hours to one hour, and the time saved is very considerable.
3. Data integration v3.0
In terms of form, there is no change in the third generation architecture of data integration, because it has achieved streaming transmission. The key is the process of merge. Merging an hour every morning is still a waste of time and resources, and even the pressure on HDFS is very great. So here, we iterate over the hidi architecture.
This is done internally based on HDFS.
When we design hidi, there are four core demands. First, it supports Flink engine reading and writing. Second, support the primary key based upsert / delete through mor mode. Third, small file management. Fourth, it supports table schema.
Based on these considerations, let’s compare hidi, Hudi and iceberg.
Advantages of hidi include:
- Support primary key based upsert / delete
- Support and Flink integration
- Small file management
Disadvantages include: incremental reading is not supported.
Hudi’s advantages include:
- Support primary key based upsert / delete
- Small file management
- Write qualified spark / deltastreamer
- Stream read / write support sparkstreaming
Iceberg’s advantages include support and Flink integration.
- Support join based upsert / delete
- Streaming read is not supported.
5. Streaming data integration effect
As shown in the figure below, we have three stages: data generation, data integration and ETL production. By integrating streaming data into t + 0, ETL production can be advanced, saving our cost.
3、 Streaming data processing
1. ETL incremental production
Let’s talk about the incremental production process of ETL. Our data comes in from the front. After Kafka, we have Flink real-time, then Kafka, then event services, and even analysis scenarios. This is our own analysis link.
The following is a link of batch processing. We integrate it into HDFS through the integration of Flink, then do offline production through spark, and then export it to OLAP application through Flink. In such an architecture, incremental production is actually the part marked as green in the figure below. We expect to replace spark with Flink’s incremental production structure.
2. SQL is the first step of ETL incremental production
Such an architecture has three core capabilities.
- First, the ability of Flink’s SQL should be aligned with spark.
- Second, our table format layer needs to be able to support the real-time operation of primary key update such as upsert / delete.
- Third, our table format can support full and incremental reading.
Our full amount is used to query and repair data, while our increment is used for incremental production. SQL is the first step of ETL incremental production. What we share today is the support of our real-time data warehouse platform based on Flink SQL.
3. Real time data warehouse model
As shown in the figure below, this is the model of real-time data warehouse. The industry should have seen such a model.
4. Real time data warehouse platform architecture
The platform architecture of real-time data warehouse is divided into resource layer, storage layer, engine layer, SQL layer, platform layer and application layer. Two points are emphasized here.
- The first is UDF support. Because UDF is a very important part of the ability to make up for operators, we hope that UDF can increase the support for SQL ability.
- Second, this architecture only supports the ability of Flink streaming. We don’t have the ability to batch Flink, because we assume that all future architectures will be based on streaming, which is consistent with the development direction of the community.
5. Real time data warehouse platform web ide
This is a web ide of our data warehouse platform. In such an IDE, we support the modeling process of SQL and the development ability of ETL.
4、 Application of streaming OLAP
1. Heterogeneous data source synchronization
Let’s look at the export of streaming and the application of OLAP. As shown in the figure below, it is the synchronization diagram of heterogeneous data sources. There are many open source products in the industry. For example, data is always exchanged in different storage. Our idea is to build a middleware like datalink, or an intermediate platform. Then we abstract the process of N-to-N data exchange into a process of n-to-1 data exchange.
2. Synchronization architecture based on dataX
The first version of heterogeneous data sources is based on dataX to do synchronization architecture. In this architecture, including the tool platform layer, scheduling layer, execution layer.
- The task of tool platform layer is very simple, mainly docking users, configuring synchronization task, configuring scheduling, operation and maintenance.
- Scheduling layer is responsible for the task scheduling, of course, for the task state management, as well as the management of the executive machine, a lot of work need to be done by ourselves.
In the real execution layer, the data is synchronized from the source to the destination through dataX process and task multithreading.
- In such an architecture, two core problems are found. The first problem is scalability. The open source stand-alone version of dataX is a stand-alone multithreading model. When we need to transfer a large amount of data, the scalability of the stand-alone multithreading model is a big problem. The second problem is in the scheduling layer, we need to manage the machine, synchronization state and synchronization task, which is very tedious. When our scheduling execution machine fails, we need to do the whole disaster recovery separately.
3. Synchronization architecture based on Flink
Based on this architecture, we changed it into a Flink synchronous architecture. The front is the same, or the tool platform layer. In the original architecture, we leave the task scheduling and execution machine management in the scheduling layer to yarn, so we are free from it. Second, our task state management in the scheduling layer can be directly migrated to the cluster.
Flink based datalink architecture has obvious advantages.
- First, the scalability problem is solved, and the architecture is very simple. Now when we break down a synchronous task, it can be spread to a distributed cluster in task manager.
- Second, offline and real-time synchronization tasks are unified into the Flink framework. All our synchronized primary keys of source and sink can be shared, which is a great advantage.
3. Key design of synchronization architecture based on Flink
Let’s take a look at the key design of the synchronization architecture based on Flink. Here are four lessons learned.
- First, avoid shuffle across taskmanager to avoid unnecessary serialization cost;
- Secondly, we must design the dirty data collection bypass and failure feedback mechanism;
- Thirdly, we use Flink’s accumulators to design an elegant exit mechanism for batch tasks;
- Fourth, S3 is used to manage reader / writer plug-ins and distribute hot load to improve deployment efficiency.
4. Flip based OLAP production platform
Based on Flink, we built a data export platform like datalink, and an OLAP production platform based on datalink. Besides the underlying engine layer, we built a platform layer. In this, we have done the corresponding management for resources, models, tasks and permissions, which makes our OLAP production very fast.
Here are two screenshots of our OLAP production. One is the management of model in OLAP, the other is the management of task configuration in OLAP.
5、 Future planning
After the corresponding iteration, we use Flink in the process of data integration, data processing, offline data export, and OLAP production. We hope that the processing of stream batch will be unified in the future, and we hope that the data will also be unified in stream batch. We hope that after data unification in the future, both real-time links and incremental links will be processed with Flink to achieve the real integration of stream and batch.