Introduction: station B selects Flink + Hudi’s data Lake technology scheme and its optimization.
Yu Zhaojing, the author of this paper, introduced why station B chose Flink + Hudi data Lake technology scheme and its optimization. The main contents are as follows:
Traditional offline warehouse pain points
Technical scheme of data Lake
Hudi mission stability assurance
Data entry practice
Incremental data Lake platform revenue
Future development and thinking
1、 Traditional offline warehouse pain points
1. Pain points
The warehousing process of the previous station B warehouse is roughly as follows:
Under this architecture, the following core pain points are generated:
After large-scale data landing in HDFS, it can only be queried and processed in the next step after partition archiving in the early morning;
RDS data synchronization with a large amount of data can only be processed after partition archiving in the early morning, and sorting, de duplication and joining the data partitioned the previous day can generate the data of the current day;
Data can only be read through partition granularity, and a large number of redundant IO will occur in scenarios such as shunting.
To sum up:
- Late start of dispatching;
- Slow merging speed;
- Repeat read multiple.
2. Pain point thinking
- Late dispatch start
Idea: since Flink ODS is written in quasi real time and has a clear concept of file increment, incremental synchronization based on file can be used to process the cleaning, dimension supplement, diversion and other logic in an incremental way, so that the data can be processed when the ODS partition is not archived. Theoretically, the delay of data only depends on the processing time of the last batch of files.
- Slow merging
Idea: since reading can be incremental, merging can also be incremental. Merging can be incremental through the ability of data Lake combined with incremental reading.
- Repeat read multiple
Idea: the main reason for repeated reading is that the granularity of the partition is too coarse, which can only be accurate to the hour / day level. We need to try some more fine-grained data organization schemes. Data skipping can be at the field level, so that efficient data query can be carried out.
3. Solution: magneto – Hudi based incremental data Lake platform
The following is the warehousing process based on magneto:
Use streaming flow to unify offline and real-time ETL pipeline
Reorganize data and speed up query
Support the compaction of incremental data
Flink is used for the computing layer and Hudi is used for the storage layer
Refine table calculation SQL logic
Standardized table format computing paradigm
2、 Technical scheme of data Lake
1. Choice between iceberg and Hudi
1.1 comparison of technical details
1.2 community activity comparison
Statistics as of August 9, 2021
It can be roughly divided into the following main latitudes for comparison:
- Support for append
Iceberg’s main support scheme at the beginning of design has been optimized for this scenario. Hudi supports the appned mode in version 0.9. At present, there is little difference between Hudi and iceberg in most scenarios. At present, it is still continuously optimized in version 0.10, which is very similar to iceberg’s performance.
- Support for upsert
At the beginning of Hudi’s design, the main support scheme has obvious advantages in performance and file quantity compared with iceberg’s design, and the comparison process and logic are all highly abstract interfaces. Iceberg’s support for upsert started late, and the community program still lags behind Hudi in terms of performance, small files and so on.
- Community activity
Hudi’s community is significantly more active than iceberg’s community. Thanks to the active community, Hudi’s richness of functions has opened a certain gap with iceberg.
Through comprehensive comparison, we chose Hudi as our data Lake component and continue to optimize the functions we need (better integration of Flink, clustering support, etc.)
2. Select Flink + Hudi as the write mode
There are three main reasons why we choose Flink + Hudi to integrate Hudi:
We partially maintain the Flink engine to support the real-time computing of the whole company. Considering the cost, we don’t want to maintain two sets of computing engines at the same time, especially when our internal spark version has also made many internal modifications.
Spark + Hudi integration scheme mainly has two index schemes to choose from, but both have disadvantages:
Bloom index: when using bloom index, spark will list all files for each task and read the bloom filter data written in the footer when writing, which will cause terrible pressure on HDFS, which is already under great internal pressure.
HBase index: this method can find the index of O (1), but it needs to introduce external dependencies, which will make the whole scheme more important.
We need to dock with the framework of Flink incremental processing.
3. Optimization of Flink + Hudi integration
3.1 Hudi version 0.8 integrated Flink solution
For the problems exposed by Hudi 0.8 integration, station B and community cooperation have been optimized and improved.
3.2 bootstrap state cold start
Background: it supports starting the Flink task writing in the existing Hudi table, so that the scheme switching from spark on Hudi to Flink on Hudi can be achieved
Problem: each task processes the full amount of data, and then selects the hoodiekey belonging to the current task to be stored in the state optimization scheme.
- During initialization, each bootstrap operator loads basefile and logfile related to fileid belonging to the current task;
- Assemble the recordkey in basefile and logfile into a hoodiekey, send it to bucketassignfunction in the form of key by, and then store the hoodiekey as an index in the state of bucketassignfunction.
Effect: by pulling the bootstrap function out of an operator alone, the scalability of index loading is achieved, and the loading speed is increased by N times (depending on the concurrency).
3.3 checkpoint consistency optimization
Background: in the streamwritefunction of Hudi version 0.8, there is a problem of data consistency in extreme cases.
Problem: the checkpoint complete is not in the CK Life cycle. The CK succeeds but the instant does not commit, resulting in data loss.
3.4 append mode support and optimization
Background: append mode is used to support data sets that do not need update. Unnecessary processing such as index and merge can be omitted in the process, so as to greatly improve the writing efficiency.
- It supports writing a new file to the flushbucket every time to avoid reading and writing amplification;
- Add parameters to support closing the speed limit mechanism inside boundedinmemeoryqueue. In Flink append mode, you only need to set the size of queue and bucket buffer to the same size;
- Develop a custom comparison plan for small files generated by each CK;
- After the above development and optimization, the performance can reach 5 times of the original cow in the pure insert scenario.
3、 Hudi mission stability assurance
1. Hudi integrated Flink metrics
By reporting metric at key nodes, you can clearly grasp the operation of the whole task:
2. Data verification in the system
3. Data verification outside the system
4、 Data entry practice
1. CDC data entering the lake
1.1 tidb Lake entry scheme
At present, various open source solutions can not directly support the data export of tidb, and the direct use of select will affect the stability of the database. Therefore, it is divided into full quantity + incremental mode:
- Start ti-cdc and write the CDC data of tidb into the corresponding Kafka topic;
- Use the dumping component provided by tidb to modify part of the source code and support direct writing to HDFS;
- Start Flink and write the full data to Hudi through bulk insert;
- CDC data of consumption increment is written to Hudi through Flink mor.
1.2 MySQL Lake entry scheme
MySQL’s Lake entry scheme directly uses the open source Flink CDC to write the full and incremental data into Kafka topic through a Flink task:
- Start the Flink CDC task to import the full data and CDC data into Kafka topic;
- Start the Flink batch task to read the full amount of data and write it to Hudi through bulk insert;
- Switch to the Flink streaming task to write incremental CDC data to Hudi through mor.
2. Log data increment into the lake
- Implement hdfsstreamingsource and readeroperator, incrementally synchronize ODS data files, and reduce list requests for HDFS by writing ODS partition index information;
- Support the configuration of transform SQL, and allow users to carry out user-defined logical transformation, including but not limited to dimension table join, user-defined UDF, shunting by field, etc;
- Implement Flink on Hudi’s append mode to greatly improve the data write rate that does not need to be merged.
5、 Incremental data Lake platform revenue
- The timeliness of data synchronization is greatly improved through Flink incremental synchronization, and the partition ready time is advanced from 2:00 to 5:00 to within 00:30;
- The storage engine uses Hudi to provide users with multiple query methods based on cow and MOR, so that different users can choose the appropriate query method according to their own application scenarios, rather than simply waiting for partition archiving;
- Compared with the previous T + 1 binlog consolidation method of data warehouse, the automatic comparison based on Hudi enables users to query hive as a MySQL snapshot;
- Greatly save resources. The diversion task that originally needs repeated query only needs to be executed once, saving about 18000 cores.
6、 Community contribution
The above optimization has been incorporated into Hudi community. Station B will further strengthen the construction of Hudi in the future and form a ⻓ with the community.
Partial core pr
7、 Future development and thinking
- The platform supports the integration of stream and batch, unifying real-time and offline logic;
- Promote the increment of data warehouse and achieve the whole process of Hudi ODS – > Flink – > Hudi DW – > Flink – > Hudi ads;
- Support Hudi’s clustering on Flink, reflect Hudi’s advantages in data organization, and explore the performance of z-order to accelerate multi-dimensional query;
- Inline clustering is supported.
This article is the original content of Alibaba cloud and cannot be reproduced without permission.