Introduction: This article is shared by Ju Dasheng, a researcher of meituan and head of real-time computing. It mainly introduces the application practice of Flink in helping meituan’s digital warehouse incremental production. The contents include: 1. Digital warehouse incremental production; 2. Streaming data integration; 3. Streaming data processing; 4. Streaming OLAP application; 5. Future planning.
1、 Digital warehouse incremental production
1. Meituan warehouse structure
First introduce the structure of meituan warehouse and incremental production. As shown in the figure below, this is the simple structure of meituan warehouse. I call it three horizontal and four vertical. The so-called three horizontal links, the first is the metadata and blood relationship throughout the whole link, and the whole process link of data integration, data processing, data consumption and data application. Another piece that runs through the whole link is data security, including limited domain authentication system, authority system and overall audit system. According to the flow of data, we divide the process of data processing into four stages: data integration, data processing, data consumption and data application.
In the data integration stage, we have corresponding integrated systems for internal data of the company, such as user behavior data, log data, DB data and file data, so as to unify the data into our data processing storage, such as Kafka.
In the data processing stage, it is divided into streaming processing link, batch processing link and data warehouse working platform (Vientiane platform) based on this set of links. The produced data is imported into the consumption storage through datalink, and finally presented in different forms through applications.
At present, we are widely used in Flink, including the process of importing data from Kafka to hive, including real-time processing and data export. Today’s sharing focuses on these aspects.
2. Application overview of meituan Flink
Meituan’s Flink currently has about 6000 physical machines, supporting about 30000 operations. The number of topics we consume is about 50000, and the daily peak traffic is 180 million per second.
3. Meituan Flink application scenario
The main application scenarios of meituan Flink include four blocks.
- First, real-time warehouse counting, business analysis, operation analysis and real-time marketing.
- Second, recommend and search.
- Third, risk control and system monitoring.
- Fourth, safety audit.
4. Real time data warehouse vs data warehouse incremental production
Next, I want to introduce the concept of incremental production. The first of the three requirements of offline data warehouse is timeliness. The second is the quality, the quality of the output data. The third is cost.
Timeliness has two deeper meanings. The first is called real-time and the second is punctuality. Not all business requirements are real-time. Many times, our needs are on time. For example, do business analysis and get the corresponding yesterday’s business data every day. Real time data warehouse is more to solve the needs of real-time. However, in punctuality, as an enterprise, I prefer to make a trade-off between punctuality and cost. Therefore, I define the incremental production of data warehouse as a trade-off between punctuality and cost for offline data warehouse. In addition, one of the better aspects of digital warehouse incremental production is quality, and problems can be found in time.
5. Advantages of digital warehouse incremental production
There are two advantages of digital warehouse incremental production.
- It can find data quality problems in time and avoid t + 1 repairing data.
- Make full use of resources and advance data output time.
As shown in the figure below, what we expect to do is actually the second picture. We expect to reduce the resources occupied by offline production, but we also hope that its output time can be one step ahead.
2、 Streaming data integration
1. Data integration v1.0
Let’s look at the first generation of streaming data integration. When the amount of data is very small and the library is very small, a batch transmission system is directly made. In the early morning of each day, load all the corresponding DB data into the data warehouse. The advantages of this architecture are very simple and easy to maintain, but its disadvantages are also very obvious. For some large dB or large data, the time to load data may take 2 ~ 3 hours, which greatly affects the output time of offline data warehouse.
2. Data integration v2.0
Based on this architecture, we add a streaming link. We will have a streaming acquisition system to collect the corresponding binlog to Kafka, import it to the original data through a Kafka 2 hive program, and then output the downstream ODS data through a layer of merge.
The advantage of data integration v2.0 is very obvious. We put the data transmission time on the day of T + 0, and only need to do a merge the next day. This time may be reduced from 2 ~ 3 hours to one hour, and the time saved is very considerable.
3. Data integration v3.0
Formally, there is no change before the third generation architecture of data integration, because it has achieved streaming transmission. The key is the later merge process. Merging for an hour every morning is still a waste of time and resources, and even the pressure on HDFS will be very great. So here, we iterated the hidi architecture.
This is what we do internally based on HDFS.
We design hidi with four core demands. First, it supports reading and writing of Flink engine. Second, the primary key based upsert / delete is supported through the mor mode. Third, small file management. Fourth, support table schema.
Based on these considerations, let’s compare hidi, Hudi and iceberg.
Hidi’s advantages include:
- Support primary key based upsert / delete
- Support and Flink integration
- Small file management
Disadvantages include: incremental reads are not supported.
Hudi’s advantages include:
- Support primary key based upsert / delete
- Small file management
- Write qualified spark / deltastreamer
- Stream reading and writing support sparkstreaming
Iceberg’s advantages include: support and Flink integration.
- Support join based upsert / delete
- Streaming reading is not supported.
5. Streaming data integration effect
As shown in the figure below, we have three stages: data generation, data integration and ETL production. By integrating streaming data to t + 0, the production of ETL can be advanced and our cost can be saved.
3、 Streaming data processing
1. ETL incremental production
Let’s talk about the incremental production process of ETL. Our data comes in from the front. After arriving at Kafka, we have Flink real-time, then to Kafka, then to the event service, and even to the analysis scenario. This is our own analysis link.
The following is a link of batch processing. We integrate it into HDFS through Flink integration, then do offline production through spark, and then export it to OLAP applications through Flink. In such an architecture, incremental production is actually the part marked green in the figure below. We expect to replace spark with Flink’s incremental production structure.
2. SQL is the first step of ETL incremental production
Such an architecture has three core capabilities.
- First, Flink’s SQL capability should be aligned with spark.
- Second, our table format layer needs to be able to support real-time operations such as upsert / delete.
- Third, our table format can support full and incremental reading.
Our full volume is used to query and repair data, and our increment is used for incremental production. Sqlization is the first step in ETL incremental production. What we share today is the support of our real-time data warehouse platform based on Flink SQL.
3. Real time warehouse model
As shown in the figure below, this is the model of real-time data warehouse. The industry should have seen such a model.
4. Real time data warehouse platform architecture
The platform architecture of real-time data warehouse is divided into resource layer, storage layer, engine layer, SQL layer, platform layer and application layer. Two points are emphasized here.
- The first is the support for UDF. Because UDF is a very important part of making up for operator capabilities, we hope that UDF done here can increase support for SQL capabilities.
- Second, this architecture only supports the ability of Flink streaming. We do not have the ability to do Flink batch processing, because we envisage that all architectures in the future will be based on streaming, which is also consistent with the development direction of the community.
5. Real time data warehouse platform web IDE
This is a web ide of our data warehouse platform. In such an IDE, we support an SQL modeling process and the ability of ETL development.
4、 Streaming OLAP application
1. Heterogeneous data source synchronization
Let’s look at the export of streaming and the application of OLAP. As shown in the following figure, it is a synchronization diagram of heterogeneous data sources. There are many open source products in the industry. For example, data is always exchanged in different storage. Our idea is to build a middleware such as datalink, or an intermediate platform. Then we abstract the N-to-N data exchange process into an n-to-1 exchange process.
2. Synchronization architecture based on dataX
The first version of heterogeneous data sources is a synchronization architecture based on dataX. This architecture includes tool platform layer, scheduling layer and execution layer.
- The tasks of the tool platform layer are very simple, mainly connecting users, configuring synchronization tasks, configuring scheduling, operation and maintenance.
- The scheduling layer is responsible for task scheduling. Of course, we need to do a lot of work for task state management and executive machine management.
In the real execution layer, the data is synchronized from the source to the destination through the process of dataX and a form of task multithreading.
- In such an architecture, two core problems are found. The first problem is scalability. The open source stand-alone version of dataX is a stand-alone multi-threaded model. When we need to transmit a large amount of data, the scalability of the stand-alone multi-threaded model is a big problem. The second problem is in the scheduling layer. We need to manage the machine, synchronization status and synchronization tasks. This work is very cumbersome. When our dispatching execution machine fails, the whole disaster recovery needs us to do this alone.
3. Flink based synchronization architecture
Based on this architecture, we changed it into a Flink synchronization architecture. The front remains unchanged, but it is still the tool platform layer. In the original architecture, we entrusted the task scheduling and execution machine management in the scheduling layer to Yan, so that we were free from it. Second, our task state management in the scheduling layer can be directly migrated to the cluster.
The architecture advantage of Flink based datalink is very obvious.
- First, the scalability problem has been solved, and the architecture is very simple. Now, after we break down a synchronized task, it can be spread to distributed clusters in task manager.
- Second, offline and real-time synchronization tasks are unified into the Flink framework. The primary keys of all our synchronized sources and sink can be shared, which is a great advantage.
3. Key design of Flink based synchronization architecture
Let’s take a look at the key design of Flink based synchronization architecture. There are four lessons learned here.
- First, avoid shuffling across taskmanagers and avoid unnecessary serialization costs;
- Second, we must design dirty data collection bypass and failure feedback mechanism;
- Thirdly, using Flink’s accumulators to design an elegant exit mechanism for batch tasks;
- Fourth, use S3 to manage the reader / writer plug-in uniformly, distributed hot loading, and improve the deployment efficiency.
4. Flink based OLAP production platform
Based on Flink, we have built a data export platform such as datalink, and an OLAP production platform based on datalink. Here, in addition to the underlying engine layer, we have built a platform layer. In this regard, we have made corresponding management for resources, models, tasks and permissions, which makes our OLAP production very fast.
Here are two screenshots of our OLAP production. One is the management of models in OLAP, and the other is the management of task configuration in OLAP.
5、 Future planning
After corresponding iterations, we applied Flink to data integration, data processing, offline data export, and OLAP production. We hope that the processing of stream batch can be unified in the future, and we hope that the data is also stream batch unified. We hope that both real-time links and incremental processing links will be processed with Flink after data unification in the future, so as to achieve the real integration of stream and batch.
Author: Alibaba cloud real time computing Flink
This article is the original content of Alibaba cloud and cannot be reproduced without permission