Practice of Apache Hudi in building real-time data lake at station B

Time:2021-10-21

Introduction: station B selects Flink + Hudi’s data Lake technology scheme and its optimization.
Yu Zhaojing, the author of this paper, introduced why station B chose Flink + Hudi data Lake technology scheme and its optimization. The main contents are as follows:

Traditional offline warehouse pain points
Technical scheme of data Lake
Hudi mission stability assurance
Data entry practice
Incremental data Lake platform revenue
Community contribution
Future development and thinking

1、 Traditional offline warehouse pain points

1. Pain points

The warehousing process of the previous station B warehouse is roughly as follows:

Practice of Apache Hudi in building real-time data lake at station B

Under this architecture, the following core pain points are generated:

After large-scale data landing in HDFS, it can only be queried and processed in the next step after partition archiving in the early morning;
RDS data synchronization with a large amount of data can only be processed after partition archiving in the early morning, and sorting, de duplication and joining the data partitioned the previous day can generate the data of the current day;
Data can only be read through partition granularity, and a large number of redundant IO will occur in scenarios such as shunting.
To sum up:

  • Late start of dispatching;
  • Slow merging speed;
  • Repeat read multiple.

2. Pain point thinking

  • Late dispatch start

Idea: since Flink ODS is written in quasi real time and has a clear concept of file increment, incremental synchronization based on file can be used to process the cleaning, dimension supplement, diversion and other logic in an incremental way, so that the data can be processed when the ODS partition is not archived. Theoretically, the delay of data only depends on the processing time of the last batch of files.

  • Slow merging

Idea: since reading can be incremental, merging can also be incremental. Merging can be incremental through the ability of data Lake combined with incremental reading.

  • Repeat read multiple

Idea: the main reason for repeated reading is that the granularity of the partition is too coarse, which can only be accurate to the hour / day level. We need to try some more fine-grained data organization schemes. Data skipping can be at the field level, so that efficient data query can be carried out.

3. Solution: magneto – Hudi based incremental data Lake platform

The following is the warehousing process based on magneto:

Practice of Apache Hudi in building real-time data lake at station B

  • Flow

Use streaming flow to unify offline and real-time ETL pipeline

  • Organizer

Reorganize data and speed up query
Support the compaction of incremental data

  • Engine

Flink is used for the computing layer and Hudi is used for the storage layer

  • Metadata

Refine table calculation SQL logic
Standardized table format computing paradigm

2、 Technical scheme of data Lake

1. Choice between iceberg and Hudi

1.1 comparison of technical details

Practice of Apache Hudi in building real-time data lake at station B

1.2 community activity comparison

Statistics as of August 9, 2021

Practice of Apache Hudi in building real-time data lake at station B

1.3 summary

It can be roughly divided into the following main latitudes for comparison:

  • Support for append

Iceberg’s main support scheme at the beginning of design has been optimized for this scenario. Hudi supports the appned mode in version 0.9. At present, there is little difference between Hudi and iceberg in most scenarios. At present, it is still continuously optimized in version 0.10, which is very similar to iceberg’s performance.

  • Support for upsert

At the beginning of Hudi’s design, the main support scheme has obvious advantages in performance and file quantity compared with iceberg’s design, and the comparison process and logic are all highly abstract interfaces. Iceberg’s support for upsert started late, and the community program still lags behind Hudi in terms of performance, small files and so on.

  • Community activity

Hudi’s community is significantly more active than iceberg’s community. Thanks to the active community, Hudi’s richness of functions has opened a certain gap with iceberg.

Through comprehensive comparison, we chose Hudi as our data Lake component and continue to optimize the functions we need (better integration of Flink, clustering support, etc.)

2. Select Flink + Hudi as the write mode

There are three main reasons why we choose Flink + Hudi to integrate Hudi:

We partially maintain the Flink engine to support the real-time computing of the whole company. Considering the cost, we don’t want to maintain two sets of computing engines at the same time, especially when our internal spark version has also made many internal modifications.

Spark + Hudi integration scheme mainly has two index schemes to choose from, but both have disadvantages:

Bloom index: when using bloom index, spark will list all files for each task and read the bloom filter data written in the footer when writing, which will cause terrible pressure on HDFS, which is already under great internal pressure.

HBase index: this method can find the index of O (1), but it needs to introduce external dependencies, which will make the whole scheme more important.

We need to dock with the framework of Flink incremental processing.

3. Optimization of Flink + Hudi integration

3.1 Hudi version 0.8 integrated Flink solution

Practice of Apache Hudi in building real-time data lake at station B

For the problems exposed by Hudi 0.8 integration, station B and community cooperation have been optimized and improved.

3.2 bootstrap state cold start

Background: it supports starting the Flink task writing in the existing Hudi table, so that the scheme switching from spark on Hudi to Flink on Hudi can be achieved

Original scheme:

Practice of Apache Hudi in building real-time data lake at station B

Problem: each task processes the full amount of data, and then selects the hoodiekey belonging to the current task to be stored in the state optimization scheme.

Practice of Apache Hudi in building real-time data lake at station B

  • During initialization, each bootstrap operator loads basefile and logfile related to fileid belonging to the current task;
  • Assemble the recordkey in basefile and logfile into a hoodiekey, send it to bucketassignfunction in the form of key by, and then store the hoodiekey as an index in the state of bucketassignfunction.

Effect: by pulling the bootstrap function out of an operator alone, the scalability of index loading is achieved, and the loading speed is increased by N times (depending on the concurrency).

3.3 checkpoint consistency optimization

Background: in the streamwritefunction of Hudi version 0.8, there is a problem of data consistency in extreme cases.

Original scheme:

Practice of Apache Hudi in building real-time data lake at station B

Problem: the checkpoint complete is not in the CK Life cycle. The CK succeeds but the instant does not commit, resulting in data loss.

Optimization scheme:

Practice of Apache Hudi in building real-time data lake at station B

3.4 append mode support and optimization

Background: append mode is used to support data sets that do not need update. Unnecessary processing such as index and merge can be omitted in the process, so as to greatly improve the writing efficiency.

Practice of Apache Hudi in building real-time data lake at station B

Main modifications:

  • It supports writing a new file to the flushbucket every time to avoid reading and writing amplification;
  • Add parameters to support closing the speed limit mechanism inside boundedinmemeoryqueue. In Flink append mode, you only need to set the size of queue and bucket buffer to the same size;
  • Develop a custom comparison plan for small files generated by each CK;
  • After the above development and optimization, the performance can reach 5 times of the original cow in the pure insert scenario.

3、 Hudi mission stability assurance

1. Hudi integrated Flink metrics

By reporting metric at key nodes, you can clearly grasp the operation of the whole task:

Practice of Apache Hudi in building real-time data lake at station B

Practice of Apache Hudi in building real-time data lake at station B

2. Data verification in the system

Practice of Apache Hudi in building real-time data lake at station B

3. Data verification outside the system

Practice of Apache Hudi in building real-time data lake at station B

4、 Data entry practice

1. CDC data entering the lake

1.1 tidb Lake entry scheme

At present, various open source solutions can not directly support the data export of tidb, and the direct use of select will affect the stability of the database. Therefore, it is divided into full quantity + incremental mode:

  • Start ti-cdc and write the CDC data of tidb into the corresponding Kafka topic;
  • Use the dumping component provided by tidb to modify part of the source code and support direct writing to HDFS;
  • Start Flink and write the full data to Hudi through bulk insert;
  • CDC data of consumption increment is written to Hudi through Flink mor.

1.2 MySQL Lake entry scheme

MySQL’s Lake entry scheme directly uses the open source Flink CDC to write the full and incremental data into Kafka topic through a Flink task:

  • Start the Flink CDC task to import the full data and CDC data into Kafka topic;
  • Start the Flink batch task to read the full amount of data and write it to Hudi through bulk insert;
  • Switch to the Flink streaming task to write incremental CDC data to Hudi through mor.

Practice of Apache Hudi in building real-time data lake at station B

2. Log data increment into the lake

  • Implement hdfsstreamingsource and readeroperator, incrementally synchronize ODS data files, and reduce list requests for HDFS by writing ODS partition index information;
  • Support the configuration of transform SQL, and allow users to carry out user-defined logical transformation, including but not limited to dimension table join, user-defined UDF, shunting by field, etc;
  • Implement Flink on Hudi’s append mode to greatly improve the data write rate that does not need to be merged.

Practice of Apache Hudi in building real-time data lake at station B

5、 Incremental data Lake platform revenue

  • The timeliness of data synchronization is greatly improved through Flink incremental synchronization, and the partition ready time is advanced from 2:00 to 5:00 to within 00:30;
  • The storage engine uses Hudi to provide users with multiple query methods based on cow and MOR, so that different users can choose the appropriate query method according to their own application scenarios, rather than simply waiting for partition archiving;
  • Compared with the previous T + 1 binlog consolidation method of data warehouse, the automatic comparison based on Hudi enables users to query hive as a MySQL snapshot;
  • Greatly save resources. The diversion task that originally needs repeated query only needs to be executed once, saving about 18000 cores.

6、 Community contribution

The above optimization has been incorporated into Hudi community. Station B will further strengthen the construction of Hudi in the future and form a ⻓ with the community.

Partial core pr

https://issues.apache.org/jir…

https://issues.apache.org/jir…

https://issues.apache.org/jir…

https://issues.apache.org/jir…

https://issues.apache.org/jir…

https://issues.apache.org/jir…

https://issues.apache.org/jir…

7、 Future development and thinking

  • The platform supports the integration of stream and batch, unifying real-time and offline logic;
  • Promote the increment of data warehouse and achieve the whole process of Hudi ODS – > Flink – > Hudi DW – > Flink – > Hudi ads;
  • Support Hudi’s clustering on Flink, reflect Hudi’s advantages in data organization, and explore the performance of z-order to accelerate multi-dimensional query;
  • Inline clustering is supported.

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Recommended Today

Swift advanced (XV) extension

The extension in swift is somewhat similar to the category in OC Extension can beenumeration、structural morphology、class、agreementAdd new features□ you can add methods, calculation attributes, subscripts, (convenient) initializers, nested types, protocols, etc What extensions can’t do:□ original functions cannot be overwritten□ you cannot add storage attributes or add attribute observers to existing attributes□ cannot add parent […]