Production practice of Flink + Hudi building real-time data Lake in linkflow

Time:2021-8-7

Variable data processing has always been a major difficulty in big data systems, especially real-time systems. After investigating various schemes, we chose the data intake scheme of CDC to Hudi. At present, minute level data real-time can be realized in the production environment. I hope this paper will inspire everyone’s production practice. The contents include:

  1. background
  2. CDC and data Lake
  3. Technical challenges
  4. effect
  5. Future plans
  6. summary

1、 Background

As a customer data platform (CDP), linkflow provides enterprises with an operation closed loop from customer data collection, analysis to execution. Every day, a large amount of data is collected through one-party data collection endpoint (SDK) and three-party data sources, such as wechat and microblog. These data will be cleaned, calculated, integrated and written to storage. Users can analyze and calculate the persistent data through flexible reports or labels, and the results will be used as the data source of MA (Marketing Automation) system, so as to realize precise marketing for specific people.

Production practice of Flink + Hudi building real-time data Lake in linkflow

In linkflow, data is divided into immutable data and mutable data. These data will participate in the analysis. There are about a dozen tables involved. Among them, the amount of immutable data is large, which can reach billions of levels. If it is put into the traditional big data system, the immutable data is the fact data, and the variable data is the dimension data. However, in real business practice, the natural attributes of users, the amount and status of orders are updatable, and the amount of these data is often very considerable. In our system, such data will reach 100 million. Variable data has always been managed through relational database mysql. First, data maintenance is convenient, and second, business connection is easy.

But the problem is also obvious:

  • Data fragmentation, because MySQL large table online DDL has a high risk, with the increase of business complexity, it is often necessary to add new sub tables to expand business attributes, that is, a complete user data will be scattered in multiple tables, which is very unfriendly to queries.
  • Multi dimensional query cannot be implemented, because the advantage of relational database is not multi-dimensional query, and it is unrealistic to quote all fields, a data component that can support OLAP query engine is needed to support the business scenario of multi-dimensional analysis. Considering the possibility of independent expansion in the future, we also give priority to the architecture of separation of computing and storage.

2、 CDC and data Lake

CDC (change data capture) is a software design pattern used to determine and track changed data so that measures can be taken for the changed data. In fact, as early as two years ago, we had the experience of using canal redundant MySQL data to heterogeneous storage, but we didn’t realize that we could integrate with big data storage in this way. In the process of using canal, we found some performance problems, and the open source community is basically not maintained. Therefore, before the start of the new architecture, we investigated Maxwell and debezium, and happened to pay attention to the open source project Flink CDC connectors [1] of Flink’s parent company ververica, which embeds debezium into Flink tasks as a binlog synchronization engine, It can easily filter, verify, data integrate and Format Convert binlog messages in the flow task, and has excellent performance. Considering that in the future, we can directly join with behavior data and even carry out simple risk control through CEP, we finally chose debezium in Flink’s CDC scheme.

Because there are many data topics in mysql, we also do data routing in the flow task, that is, the change data of different topics will be routed to different Kafka topics, that is, Kafka as ODS. This has many advantages. Firstly, for variable data, we can clearly observe the process of each change. Secondly, we can play back the data. The superposition result of successive changes is the final state.
Production practice of Flink + Hudi building real-time data Lake in linkflow

The next thing to consider is where the data exists. Combined with the principle of “separation of computing and storage” mentioned above, this is also an advantage provided by the data lake. The data lake is generally built using similar file system storage (object storage or traditional HDFS), which is just in line with our expectations. After comparing several data Lake schemes, we chose Apache Hudi for the following reasons:

  • Hudi provides a solution for upsert in HDFS, which is similar to the use experience of relational database. It is very friendly to updatable data and conforms to the semantics of MySQL binlog.
  • Incremental query can easily obtain the data changed in the last 30 minutes or one day, which is very friendly for some superimposable offline computing tasks. It no longer needs to calculate the full amount of data, but only the changed data, which greatly saves machine resources and time.
  • Metadata can be synchronized to hive in real time, creating conditions for “you can check it when you enter the lake”.
  • Two different usage scenarios of cow and mor are optimized respectively.
  • Hudi community is open and iterative. In its incubation stage, it is integrated by AWS EMR, and then integrated by Alibaba cloud DLA data Lake analysis [2], Alibaba cloud EMR [3] and Tencent cloud EMR [4]. The prospect is good. At the same time, the discussion of Apache Hudi domestic technology exchange group is very enthusiastic, and there are more and more companies building data lakes based on Hudi in China.

After integrating Hudi, our architecture evolved into this:
Production practice of Flink + Hudi building real-time data Lake in linkflow

Cow (copy on write) mode is selected for data tables, mainly considering the characteristics of more reads and less writes, and we need to query the process as fast as possible. The performance of mor (merge on read) strategy at the query end is still slightly weak. In addition, there is no sub second requirement for data delay, so cow is finally selected.

At the top layer, we use Presto as the analysis engine to provide the ability of ad hoc data query. Since the Hudi version we use is 0.6.0 and the integration with Flink has not been released, we have to adopt the Flink + spark dual engine strategy and use spark streaming to write the data in Kafka to Hudi.

3、 Technical challenges

After POC, we determined the architecture design shown in the figure above, but we also encountered many challenges in the real implementation process.

3.1 customization of CDC operation mode

■ full volume mode

A major advantage of debezium is the “integration of batch flow”. In the snapshot stage, the data is replayed into messages consistent with the binlog incremental log content by scanning the full table, so that users can use the same code to process full and incremental data at the same time. However, in our business practice, if there are a large number of historical tables and data in the tables, the duration of the snapshot phase will be very long. Once this process is interrupted unexpectedly, we need to start scanning again from the first table next time. Assuming that the complete snapshot process takes several days, we can’t accept “retry” on this scale, so we need a mechanism similar to breakpoint continuation. After querying the debezuim official document, we found the snapshot.include.collection.list parameter.

An optional, comma-separated list of regular expressions that match names of schemas specified
 in table.include.list for which you want to take the snapshot.

Therefore, you can pass in the remaining tables to be scanned through this parameter after the snapshot is interrupted, so as to realize the ability of “relay”. However, one thing to note here is that no matter how many times the snapshot phase is retried, the incremental binlog site must be the site at the first snapshot, otherwise data will be lost. This also brings another problem. If the relay is interrupted until the snapshot is completed, debezuim will automatically start incremental data synchronization directly from the binlog site of this (not the first) snapshot. This is not the result we need. We need to terminate the task directly after the snapshot is completed.

I turned to a lot of debezuim documents and didn’t find such a function, but I saw that there was a way in the process of browsing the source code.

/**
*Perform a snapshot andthen stop before attempting to read the binlog.
*/
INITIAL_ONLY("initial_only",true);
// MySqlConnectorTask.java
if(taskContext.isInitialSnapshotOnly()){
    logger.warn("This connector will only perform a snapshot, and will stop after that completes.");
    chainedReaderBuilder.addReader(newBlockingReader("blocker",
"Connector has completed all of its work but will continue in the running state. It can be shut down at any time."));
    chainedReaderBuilder
.completionMessage("Connector configured to only perform snapshot, and snapshot completed successfully. Connector will terminate.");
}

That is, in initial_ In the only mode, debezuim will use blockingreader instead of binlogreader to block the thread without incremental consumption.

■ incremental mode

If the task stops automatically after the snapshot ends, you need to manually restart the task to continue incremental synchronization. At the same time, the incremental mode needs to support the specified MySQL binlog file and specific position. Debezuim’s own schema_ only_ Recovery mode, parameters can be set manually.

DebeziumOffset specificOffset =newDebeziumOffset();
Map<String,Object> sourceOffset =newHashMap<>();
sourceOffset.put("file", startupOptions.specificOffsetFile);
sourceOffset.put("pos", startupOptions.specificOffsetPos);
specificOffset.setSourceOffset(sourceOffset);

Since the version of verica / flex CDC connectors we used earlier is 1.2.0, debezuim’s schema is not open_ only_ Recovery mode, so the relevant source code has been modified. At present, version 1.3.0 has been supported. It can be passed in as a startup parameter in mysqlsourcebuilder.

3.2 patch update

Here, it is necessary to explain what is an overlay update and what is a partial update. In fact, this also corresponds to the restful semantics. Put is an overlay update. It requires the caller to provide a complete resource object. Theoretically, if put is used but the complete resource object is not provided, the missing fields should be cleared. Patch corresponds to partial update or local update. The caller only provides the fields to be updated, but does not provide a complete resource object. The advantage is that bandwidth can be saved.

In Hudi, only overwrite update is supported by default, but for our business, the data reported by the collection endpoint cannot contain complete business objects. For example, with the growth of user age, only one field of information will be included in the report.

{
  "id": 123,
  "ts": 1435290195610,
  "data": {
    "age": 25
  }
}

This requires finding out the data content of rowkey = 123, merging it with the content to be updated, and then writing it. When merging, if the field of data to be written is not empty, merge. By default, Hudi adopts the combineandgetupdatevalue method of overwritewithlatestavropayload.

Simply overwrites storage with latest delta record

For forward compatibility, Karl, a data development colleague, added the overwritenondefaultswithlatestavropayload class, overridden combineandgetupdatevalue to handle the above problems, and fed back to the community [hudi-1255] add new payload (overwritenondefaultswithlatestavropayload) for updating specified fields in storage [5]. In fact, there are many similar needs in the community, For example, [hudi-1160] support update partial fields for cow table [6], we also expect more developers to improve this function.

Of course, there are limitations here. If you really want to update a field to a null value, you can’t use overwritenondefaultswithlatestavropayload.

At the same time, we have also supplemented the community’s comparison strategy and added a time-based comparison scheduling strategy, that is, we can not only make a comparison based on the number of incremental submissions, but also make a comparison based on time. This work has also been fed back to the community. See [hudi-1381] schedule compaction based on time elapsed [7], This provides greater flexibility for comparison within a specified time.

3.3 consolidation of the same rowkey data in a batch

One of the characteristics of CDC is to monitor data changes in real time. For example, the status of an order may change several times in a few minutes. In addition, spark streaming micro batch processing has a high probability of obtaining a large number of data of the same rowkey in a time window, and different rowkeys correspond to some data, Therefore, we merged a batch of data with the same rowkey in the streaming task, which is similar to the logic of Hudi using bloom to judge whether the rowkey exists. Special attention should be paid to the timing problem. The superposition of data must be in strict accordance with the TS time, otherwise the old version of data will overwrite the new version.

3.4 Schema evolution

Due to the development of business and the requirements of flexibility, schema evolution must be just needed. Hudi just took this into account. We learned from Hudi’s wiki [8]:

What's Hudi's schema evolution story 

Hudi uses Avro as the internal canonical representation for records, primarily due to its nice schema compatibility & evolution[9] properties. This is a key aspect of having reliability in your ingestion or ETL pipelines. As long as the schema passed to Hudi (either explicitly in DeltaStreamer schema provider configs or implicitly by Spark Datasource's Dataset schemas) is backwards compatible (e.g no field deletes, only appending new fields to schema), Hudi will seamlessly handle read/write of old and new data and also keep the Hive schema up-to date.

Since the Avro format itself supports schema evolution, naturally Hudi also supports it.

  1. Schema evolution can be roughly divided into four types:
  2. Backwards compatible: backward compatible. Old data can be read with the new schema. If the field has no value, use the default value. This is also the compatible method provided by Hudi.
  3. Forwards compatible: forward compatible. New data can be read with the old schema. Avro will ignore the newly added fields. If forward compatible, the deleted fields must have default values.
  4. Full compatible: supports forward compatibility and backward compatibility. If you want full compatibility, you need to add only fields with default values and remove only fields with default values.
  5. No compatibility checking: Generally speaking, it is necessary to forcibly change the type of a field. At this time, full data migration is required. It is not recommended.

In production practice, we can realize the requirement of field extension by modifying the schema. However, some problems will be found. For example, too many fields will cause a large single file (exceeding 128MB) and slow writing. In extreme cases, the writing of more than 1000 columns of files will reach the hourly level. In the future, we are also looking for some optimization schemes, such as field recycling or vertical table splitting, to reduce the number of fields in a single file.

3.5 exceptions caused by simultaneous query and write

This is a problem on the query side. When we use Presto to query the hive table, there will be an exception that the Hudi metadata file cannot be found, resulting in NPE in Hudi.

Error checking path :hdfs://hudipath/.hoodie_partition_metadata, under folder: hdfs://hudipath/event/202102; nested exception is java.sql.SQLException: Query failed (#20210309_031334_04606_fipir)

Based on the above information, it is suspected that the metadata information is modified while querying. After asking for help from the community, we changed the hoodiepathcache in the hoodierotablepathfilter to a thread safe concurrent HashMap, repackaged hudi-hadoop-mr.jar and hudi-common.jar, replaced them in the directory Presto / plugin / hive-hadoop 2, and restarted presto. No NPE was found in the follow-up.

4、 Effect

Let’s review our assumptions about the data lake at the beginning of the architecture:

  • Support variable data.
  • Schema evolution is supported.
  • Computing and storage are separated, and multiple query engines are supported.
  • Support incremental view and time travel.

These features are basically realized by Hudi. After the completion of the new architecture, compared with the previous system, the data delay and offline processing performance have been significantly improved, as shown in:

  1. The real-time data writing process is simplified, and the previous update operation is cumbersome. Now we basically don’t care whether to add or update in the development process, which greatly reduces the mental burden of developers.
  2. The time from real-time data entering the lake to query is shortened. Although we adopt the cow table mode, the actual test shows that the timeliness from entering the lake to query is not low, which is basically at the minute level.
  3. The offline processing performance is improved. Based on Hudi’s incremental view feature, the daily offline tasks can easily obtain the data changed in the past 24 hours, and the processed data magnitude becomes smaller, resulting in shorter processing time.

5、 Future plans

5.1 Flink integration

The “forced” dual engine strategy mentioned earlier is actually very distressed. The operation and maintenance and development methods cannot be unified, so we are very concerned about the progress of Hudi’s official integration of Flink. Recently, there has been a new RFC – 24: houdie Flink writer proposal [10], and Flink capability has been deeply integrated in Hudi version 0.8.0, It is expected that the future integrated version of Flink can greatly improve the performance. At the same time, the processing engine can be unified into Flink instead of dual engine mode.

5.2 concurrent write

In order to ensure the consistency of metadata, the Hudi file does not support concurrent writing before version 0.8.0. However, in practical application, many data in the data lake are not only real-time data, but also need to be obtained through offline calculation. If some fields of a table are the direct reflection of CDC and the other fields are the calculation results of offline tasks, this will bring the demand of parallel writing.
Production practice of Flink + Hudi building real-time data Lake in linkflow

We currently use two ways to avoid:

Vertical table splitting, that is, the two files are separated, CDC data is written through spark streaming, and offline calculation results are written to another file to avoid concurrent writing.

Simulate that CDC messages are written back to Kafka. If tables cannot be divided for query performance, offline calculation results will be simulated as CDC messages written to Kafka, and then written to Hudi through spark streaming. But the disadvantage is also obvious, that is, the result of offline tasks is reflected in the final storage with a long delay.

Recently, version 0.8.0 released by Hudi has supported the concurrent write mode. It is based on optimistic lock, and file level conflict detection can well meet the concurrent write requirements. Subsequent tests will see the effect.

5.3 performance optimization

The above also mentioned some problems such as large files and frequent GC. From a comprehensive point of view, we find that the write bottleneck mainly occurs in two places.

Production practice of Flink + Hudi building real-time data Lake in linkflow

■ index

At present, we adopt houdieglobalbloomindex, which leads to a long time for index establishment and query. Three kinds of index implementations are officially provided:

How does the Hudi indexing work 
& what are its benefits? 

The indexing component is a key part of the Hudi writing and it maps a given recordKey to a fileGroup inside Hudi consistently. This enables faster identification of the file groups that are affected/dirtied by a given write operation.
Hudi supports a few options for indexing as below

• HoodieBloomIndex (default)
 : Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
•HoodieGlobalBloomIndex : The default indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even very large datasets[11]. However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is used, incoming records are compared to files across the entire dataset and ensure a recordKey is only present in one partition.
•HBaseIndex : Apache HBase is a key value store, typically found in close proximity to HDFS. You can also store the index inside HBase, which could be handy if you are already operating HBase.

You can implement your own index if you’d like, by subclassing the HoodieIndex class and configuring the index class name in configs.
After discussion with the community, we prefer to use hbaseindex or similar K-V store to manage the index.

■ update

In addition to the problem of large files, the speed of upsert is also related to the characteristics of CDC. In fact, the update range of variable data is unpredictable. In extreme cases, when 1000 pieces of data to be updated belong to 1000 different files, the update performance is difficult to be improved by code optimization, which can only increase CPU resources and improve processing parallelism. We will start from several aspects:

  1. Parameter adjustment, whether there is a way to balance the number and size of files.
  2. Try to use mor mode for some business tables. When mor is updated, it will first write the data to the log file and then merge it into parquet. Theoretically, the frequency of overwriting parquet file can be reduced.
  3. Discuss trade-off in business in exchange for better write speed.

6、 Summary

There is still a lot to be optimized in the future. We will also actively participate in community construction, try new features and bring users a better data service experience. Finally, thank the developers and community maintainers of Flink CDC connectors and Apache Hudi.

Author Dean, chief architect of linkflow

Reference link

[1] flink-cdc-connectors: https://github.com/ververica/…
[2] Alibaba cloud DLA data Lake analysis:https://help.aliyun.com/docum…
[3] Alibaba cloud EMR:https://help.aliyun.com/docum…
[4] Tencent cloud EMR:https://cloud.tencent.com/doc…
[5] [HUDI-1255] Add new Payload(OverwriteNonDefaultsWithLatestAvroPayload) for updating specified fields in storage: https://github.com/apache/hud…
[6] [HUDI-1160] Support update partial fields for CoW table: https://github.com/apache/hud…
[7] [HUDI-1381] Schedule compaction based on time elapsed: https://github.com/apache/hud…
[8] wiki: https://cwiki.apache.org/conf…
[9] schema compatibility & evolution: https://docs.confluent.io/cur…
[10] RFC – 24: Hoodie Flink Writer Proposal: https://cwiki.apache.org/conf…
[11] very large datasets: https://eng.uber.com/uber-big…
[12] [email protected]: mailto:[email protected]