37 mobile Tour Based on Flink CDC + Hudi Lake Warehouse Integration Scheme

Time:2021-12-27

Introduction: This paper introduces why 37 mobile games choose Flink as the computing engine and how to build a new lake Warehouse Integration Scheme Based on Flink CDC + Hudi.

The author of this paper is Xu runbai, the big data developer of 37 mobile games. He introduces why 37 mobile games chose Flink as the computing engine and how to build a new lake Warehouse Integration Scheme Based on Flink CDC + Hudi. The main contents include:

Introduction to Flink CDC Basics

Introduction to Hudi Basics

37 business pain points and technical scheme selection of mobile games

37 introduction to hand tour Lake Warehouse Integration

Flink CDC + Hudi practice

summary

1、 Flink CDC 2.0

Flink CDC connectors is a source side connector of Apache Flink. At present, version 2.0 supports obtaining data from MySQL and Postgres data sources. Version 2.1 community will definitely support Oracle and mongodb data sources.

The core feature of Fink CDC 2.0 mainly realizes the following three very important functions:

There is no lock in the whole process, which will not cause the risk of locking the database;

Multi parallelism. The reading stage of full data supports horizontal expansion, so that 100 million level large tables can speed up the reading speed by increasing the parallelism;

For breakpoint continuation, checkpoint is supported in the full volume stage. Even if the task exits for some reason, the task can be recovered through the saved checkpoint to realize breakpoint continuation of data.

Flink CDC 2.0 explains core improvements

2、 Hudi

Apache Hudi is currently described by the industry as a streaming data Lake platform built around the database kernel.

Because Hudi has good upsert capability, and 0.10 master supports Flink version to 1.13 x. Therefore, we choose to provide minute level upsert data analysis and query capability for the business scenario of 37 mobile games through Flink + Hudi.

3、 37 business pain points and technical scheme selection of mobile games

37 mobile Tour Based on Flink CDC + Hudi Lake Warehouse Integration Scheme

1. Old architecture and business pain points

1.1 insufficient real-time data

  • Log data is synchronized to hive every 30min through sqoop for the first 60min;
  • Database data is synchronized to hive every 60min through sqoop;
  • Database class data is synchronized to hive every day through sqoop for the first 60 days.

1.2 business code logic is complex and difficult to maintain

At present, there are many business development of 37 mobile games, which follow the development mode of MySQL + PHP. The code logic is complex and difficult to maintain;
For the same code logic, one code needs to be developed for stream processing, and another code needs to be developed for batch processing, which cannot be reused.

1.3 historical data

Frequently re brush historical data to ensure data consistency.

1.4 frequent schema changes

Due to business requirements, it is often necessary to add table fields.

1.5 hive version low

  • Hive currently uses version 1.0 X version, and it is difficult to upgrade the version;
  • Upsert is not supported;
  • Row level delete is not supported.

Due to the business scenario of 37 mobile games, data upsert and delete are common requirements. Therefore, the architecture based on hive data warehouse does not meet the business requirements enough.

2. Technical selection

Canal and Maxwell have been considered in the selection of synchronization tools. However, canal is only suitable for incremental data synchronization and needs to be deployed, which is relatively heavy to maintain. Although Maxwell is relatively lightweight, it needs to be used with Kafka and other message queues like canal. In contrast, Flink CDC can be used based on Flink SQL by configuring Flink connector. It is very lightweight and perfectly fits the integrated flow batch architecture based on Flink SQL.

In terms of storage engine selection, the most popular data Lake products at present are Apache Hudi, Apache iceberg and deltalake, which have their own advantages and disadvantages in our scenario. Finally, based on Hudi’s openness to upstream and downstream ecology, support for global index, support for Flink version 1.13, and compatibility with hive Version (iceberg does not support hive1. X version), Hudi was selected as the storage engine of Lake Warehouse Integration and stream batch integration.

In view of the above business pain points and model selection comparison, our final scheme is: take flink1 13.2 as a computing engine, relying on the unified API of stream batch provided by Flink, the integration of stream batch is realized based on Flink SQL, Flink CDC 2.0 as the data synchronization tool of ODS layer and hudi-0.10 master as the storage engine, so as to solve the business pain point of maintaining two sets of codes.

4、 The new architecture is integrated with the lake warehouse

The lake warehouse integration scheme of 37 hand tour is a part of the integrated architecture of 37 hand tour flow batch. Through the integration of Lake warehouse and flow batch, we can achieve: Data homology, the same computing engine, the same storage and the same computing caliber in the quasi real-time scenario. The timeliness of data can reach the minute level, which can well meet the needs of business quasi real-time data warehouse. The following is the architecture diagram:

37 mobile Tour Based on Flink CDC + Hudi Lake Warehouse Integration Scheme

MySQL data enters Kafka through Flink CDC. The reason why the data enters Kafka first rather than directly into Hudi is to reuse the data from MySQL for multiple real-time tasks, so as to avoid the impact on the performance of MySQL database caused by multiple tasks connecting MySQL tables and binlog through Flink CDC.

In addition to the ODS layer of the offline data warehouse, the data entered into Kafka through CDC will be transferred from ODS – > DWD – > DWS – > OLAP database according to the link of the real-time data warehouse, and finally used for data services such as reports. The result data of each layer of the real-time data warehouse will be sent to the offline data warehouse in quasi real time. In this way, the program is developed once, the index caliber is unified, and the data is unified.

From the architecture diagram, we can see that there is a step of data correction (rerunning historical data). The reason for this step is that there may be rerunning historical data due to caliber adjustment or error in the calculation result of the real-time task of the previous day.

The data stored in Kafka has expiration time and will not store historical data for too long. The historical data that runs for a long time cannot obtain historical source data from Kafka. Moreover, if a large amount of historical data is pushed to Kafka again and the historical data is corrected through the link of real-time calculation, the real-time operation of the day may be affected. Therefore, the rerun history data will be processed through data correction.

Generally speaking, the data warehouse of 37 mobile games belongs to the hybrid architecture of lambda and kappa. Each data link of the stream batch integrated data warehouse has a data quality verification process. The next day, the data of the previous day is reconciled. If the data calculated in real time the previous day is normal, there is no need to correct the data. The kappa architecture is sufficient.

5、 Flink CDC 2.0 + Kafka + Hudi 0.10 practice

1. Environmental preparation

Flink 1.13.2
…/ lib/hudi-flink-bundle_ 2.11-0.10. 0-SNAPSHOT. Jar (modify the Hudi Flink version of the master branch to 1.13.2 and build it)
…/ lib/hadoop-mapreduce-client-core-2.7. 3. Jar (solve Hudi classnotfoundexception)
../lib/flink-sql-connector-mysql-cdc-2.0.0.jar
../lib/flink-format-changelog-json-2.0.0.jar
../lib/flink-sql-connector-kafka_2.11-1.13.2.jar

Source mysql-cdc table definition:

create table sy_payment_cdc (
  ID BIGINT,
  ...
  PRIMARY KEY(ID) NOT ENFORCED
) with(
  'connector' = 'mysql-cdc',
  'hostname' = '',
  'port' = '',
  'username' = '',
  'password' = '',
  'database-name' = '',
  'table-name' = '',
  'connect.timeout' = '60s',
  'scan.incremental.snapshot.chunk.size' = '100000',
  'server-id'='5401-5416'
);

It is worth noting that: scan incremental. snapshot. chunk. The size parameter needs to be configured according to the actual situation. If the amount of table data is small, use the default value.

Definition of Kafka + Hudi cow table at sink end:

create table sy_payment_cdc2kafka (
  ID BIGINT,
  ...
  PRIMARY KEY(ID) NOT ENFORCED
) with (
  'connector' = 'kafka',
  'topic' = '',
  'scan.startup.mode' = 'latest-offset',
  'properties.bootstrap.servers' = '',
  'properties.group.id' = '',
  'key.format' = '',
  'key.fields' = '',
  'format' = 'changelog-json'
);

create table sy_payment2Hudi (
  ID BIGINT,
  ...
  PRIMARY KEY(ID) NOT ENFORCED
)
PARTITIONED BY (YMD)
WITH (
  'connector' = 'Hudi',
  'path' = 'hdfs:///data/Hudi/m37_mpay_tj/sy_payment',
  'table.type' = 'COPY_ON_WRITE',
  'partition.default_name' = 'YMD',
  'write.insert.drop.duplicates' = 'true',
  'write.bulk_insert.shuffle_by_partition' = 'false',
  'write.bulk_insert.sort_by_partition' = 'false',
  'write.precombine.field' = 'MTIME',
  'write.tasks' = '16',
  'write.bucket_assign.tasks' = '16',
  'write.task.max.size' = '',
  'write.merge.max_memory' = ''
);

For historical data entering Hudi, you can select offline bulk_ Insert into the lake, load the data through the load index bootstrap, and then get back the incremental data. bulk_ The uniqueness of the data entering the lake in the insert mode depends on the data itself at the source end. It is also necessary to ensure that the data is not lost when receiving the incremental data.

Here, we choose a simpler way to adjust task resources to put historical data into the lake. Depending on Flink’s checkpoint mechanism, tasks can be restarted by specifying a checkpoint during the period when CDC 2.0 enters Kafka or Kafka enters Hudi, and data will not be lost.

When configuring CDC 2.0 into Kafka and Kafka into Hudi task, we can increase the memory and configure multiple parallelism to speed up the historical data into the lake. After all historical data into the lake, we can correspondingly reduce the memory configuration of the task into the lake, and set the parallelism of CDC into Kafka to 1, because CDC is single parallelism in incremental stage, and then specify checkpoint to restart the task.

According to the parameter configuration defined in the above table, 16 parallelism are configured. When the memory size of Flink taskmanager is 50g, the actual time for inputting 1.5 billion historical data from a single table to Hudi cow table is 10 hours, and the actual time for inputting 900 million data from a single table to Hudi cow table is 6 hours. Of course, a large part of this time-consuming is the characteristic of cow write amplification, which takes more time in the upsert mode with a large amount of data.

At present, our cluster is composed of more than 200 machines. The total number of online stream computing tasks is more than 200, and the total amount of data is close to 2PB.

If the cluster resources are very limited, you can adjust the memory configuration of Hudi table and Flink task according to the actual situation. You can also configure the current limit parameter write rate. Limit allows historical data to slowly enter the lake.

37 mobile Tour Based on Flink CDC + Hudi Lake Warehouse Integration Scheme

Previously Flink CDC 1 In version x, due to the single parallelism reading in the full snapshot stage, tables above 100 million took a long time in the full snapshot reading stage, and the checkpoint would fail, which could not ensure the continuous transmission of data at the breakpoint.

Therefore, when entering Hudi, start a CDC 1 first The program of X writes the incremental data from now on into Kafka, then starts another sqoop program to pull all the current data to hive, reads hive’s data through Flink and writes it to Hudi, and finally connects Kafka’s incremental data back to Hudi from scratch. Due to the intersection of Kafka and hive’s data, the data will not be lost. In addition, Hudi’s upsert capability ensures the uniqueness of the data.

However, the link in this way is too long and difficult to operate. Today, CDC 2.0 supports multi parallelism and checkpoint in the full snapshot stage, which does greatly reduce the complexity of the architecture.

2. Data comparison

  • Because the production environment uses hive1 x. Hudi for 1 X does not support data synchronization, so query by creating hive external table. If it is hive2 X above, please refer to hive synchronization chapter;
  • Create hive external table + pre create partition;
  • Add hudi-hadoop-mr-bundle-0.10 to the auxlib folder 0-SNAPSHOT. jar。

    CREATE EXTERNAL TABLE m37_mpay_tj.`ods_sy_payment_f_d_b_ext`(
    `_hoodie_commit_time` string,
    `_hoodie_commit_seqno` string,
    `_hoodie_record_key` string,
    `_hoodie_partition_path` string,
    `_hoodie_file_name` string,
    `ID` bigint,
    ...
    )
    PARTITIONED BY (
    `dt` string)
    ROW FORMAT SERDE
    'org.apache.hadoop.Hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT
    'org.apache.Hudi.hadoop.HoodieParquetInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.Hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
    'hdfs:///data/Hudi/m37_mpay_tj/sy_payment'

    Finally, query the Hudi data (in the form of hive external table) and compare it with the hive data synchronized by the original sqoop to obtain:

  • The total number is consistent;
  • The statistical quantity by day is the same;
  • The statistical amount grouped by day is consistent.

6、 Summary

Lake warehouse integrated architecture and flow batch integrated architecture have the following advantages over the traditional digital Warehouse Architecture:

  • Hudi provides the upsert capability to solve the pain points of frequent upsert / delete;
  • Provide minute level data, which has higher timeliness than traditional data warehouse;
  • The integration of stream and batch is realized based on Flink SQL, and the code maintenance cost is low;
  • Data are of the same origin, the same computing engine, the same storage and the same computing caliber;
  • Flink CDC is selected as the data synchronization tool to save the maintenance cost of sqoop.

Finally, in view of the pain point of frequently adding table fields, and we hope that this field can be automatically added when synchronizing downstream systems. At present, there is no perfect solution. We hope that Flink CDC community can provide schema evolution support in subsequent versions.

Reference

[1] MySQL CDC document:https://ververica.github.io/f…

[2] Hudi Flink Q & A:https://www.yuque.com/docs/sh…

[3] Some of Hudi’s designs:https://www.yuque.com/docs/sh…

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.