Baixin bank’s real-time data lake evolution scheme based on Apache Hudi

Time:2021-7-26

This paper introduces the construction of real-time computing platform of Baixin bank, the scheme and practical method of building real-time data lake on Hudi, and the way of integrating Hudi and using Hudi. The contents include:

  1. background
  2. Design and practice of real-time computing platform of Baixin bank based on Flink
  3. Integration practice of real-time computing platform and real-time data lake of Baixin bank
  4. The future of Baixin bank’s real-time data Lake
  5. summary

1、 Background

Baixin bank, fully known as “CITIC Baixin Bank Co., Ltd”, is the first direct selling Bank approved as an independent legal person. As the first state-controlled internet bank, Baixin bank has higher requirements for data agility than the traditional financial industry.

Data agility requires not only the accuracy of data, but also the real-time arrival of data and the security of data transmission. In order to meet the demand of our bank’s data agility, the big data Department of Baixin bank undertook the responsibility of building a real-time computing platform to ensure that the data can be delivered online quickly, safely and in a standard manner.

Benefiting from the development and update iteration of big data technology, the two well-known pillars of batch flow integration are “unified computing engine” and “unified storage engine”.

  • Flink, as a leader in the field of big data real-time computing, has been further improved by the release of version 1.12Unified computing engineAbility to;
  • At the same time, with the development of data Lake Technology Hudi,Unified storage engineAlso ushered in a new generation of technological change.

Based on the development of Flink and Hudi communities, Baixin bank has built a real-time computing platform and integrated the real-time data Lake Hudi into the real-time computing platform. Combined with the idea of intra bank data governance, the goal of real-time online, safe and reliable data, unified standard and agile data lake is realized.

2、 Design and practice of real-time computing platform of Baixin bank based on Flink

1. Positioning of real-time computing platform

As a line level real-time computing platform, the real-time computing platform is independently developed by the big data IAAs team. It is an enterprise level product that realizes “end-to-end” online data processing of real-time data.

  • Its core functions include real-time collection, real-time calculation, real-time warehousing, complex time processing, rule engine, visual management, one click configuration, independent online, real-time monitoring and early warning, etc.
  • At present, it supports real-time data warehouse, breakpoint recall, intelligent risk control, unified asset view, anti fraud, and real-time feature variable processing.
  • Moreover, it serves many business lines in the bank, such as small and micro enterprises, credit, anti fraud, consumer finance, finance, risk and so on.

Up to now, there are 320 + real-time tasks running stably online, and the daily QPS of online tasks reaches about 170W.

2. Architecture of real-time computing platform

According to functions, the architecture of real-time computing platform is mainly divided into three layers:

■ 1) data acquisition layer

At present, the acquisition layer is mainly divided into two scenarios:

  • The first scenario is to collect the binlog logs of the MySQL standby database into Kafka. The data acquisition scheme used by our bank does not adopt the existing CDC schemes commonly used in the industry, such as canal, debezium, etc.

    1. Because our MySQL version is the internal version of Baixin bank, and the binlog protocol is different, the existing technical solutions can not be well compatible with us to obtain binlog logs.
    
    2. At the same time, in order to solve the problem that the backup database of our data source MySQL may lose the collected data due to the switching of multiple machine rooms at any time. We have developed the databus project of reading MySQL binlog. We have also transformed the databus logic into Flink application and deployed it into the Yan resource framework, so that the databus data extraction can be highly available and the resources can be controlled.
  • The second scenario is that we connect with a third-party application. The third-party application will write data to Kafka, and there are two ways to write to Kafka:

    1. One way is based on the JSON shcema protocol we defined.

    (UMF protocol: {Col)_ name:””,umf_ id”:””,”umf_ ts”:,”umf_ op_”:” i/u/d”})
    The protocol defines “unique ID”, “timestamp” and “operation type”. According to this protocol, the user can specify the operation types of the message, namely “insert”, “update” and “delete”, so that the downstream can process the message specifically.

    2. In another way, the user directly writes JSON type data to Kafka without distinguishing between operation types.

■ 2) data calculation conversion layer

The consumption of Kafka data carries out a layer of conversion logic, supports user-defined functions, standardizes data, desensitizes and encrypts sensitive data, etc.

■ 3) data storage layer

Data is stored in HDFS, kudu, tidb, Kafka, Hudi, MySQL and other storage media.

Baixin bank's real-time data lake evolution scheme based on Apache Hudi

In the architecture diagram shown in the figure above, we can see that the main functions supported by the overall real-time computing platform are:

  • Development level:

    1. It supports the standardized databus collection function. This function adapts the synchronization of MySQL binglog to Kafka without user intervention. Users only need to specify the instance of MySQL data source to complete the standardized synchronization to Kafka.
    2. Support users to visually edit flinksql.
    3. Support user-defined Flink UDF functions.
    4. Support complex event processing (CEP).
    5. Support users to upload, package and compile Flink applications.
  • Operation and maintenance level:

    1. Support status management of different types of tasks and savepoint.
    2. Support end-to-end delay monitoring and alarm.

During the real-time computing platform upgrade iteration, there are some downward incompatibilities between community Flink versions. In order to smoothly upgrade the Flink version, we uniformly abstract the multi version modules of the computing engine, and strictly isolate the multi versions at the JVM level, so that there will be no jar package conflict and Flink API incompatibility between the versions.

Baixin bank's real-time data lake evolution scheme based on Apache Hudi

As shown in the figure above, we package different versions of Flink into an independent virtual machine, start an independent JVM virtual machine with thrift server, and each version of Flink will have an independent thrift server. During use, as long as the user displays the specified Flink version, the Flink application will be started by the specified thrift server. At the same time, we also embed the back-end service of real-time computing into a commonly used Flink version to avoid taking too much startup time because of starting thrift server.

At the same time, in order to meet the requirements of high availability and multiple standby of the financial system, the real-time computing platform also develops the support of multiple Hadoop clusters, which supports the migration of real-time computing tasks to the standby cluster after failure. The overall scheme is to support multi cluster checkpoints and savepoints. After the support task fails, you can restart the real-time task in the standby machine room.

3、 Integration practice of real-time computing platform and real-time data lake of Baixin bank

Before introducing this content, let’s first understand the current situation of our bank in the data lake. In the current real-time data lake, BQD still adopts the mainstream lambda architecture to build the data warehouse.

Baixin bank's real-time data lake evolution scheme based on Apache Hudi

1. Lambda

Disadvantages of data warehouse under lambda architecture:

  • For the same requirements, develop and maintain two sets of code logic: both batch and flow logic codes need to be developed and maintained, and the consolidated logic needs to be maintained and online at the same time;
  • More computing and storage resources are occupied: if the same computing logic is calculated twice, the overall resource occupation will increase;
  • Data has two meanings: two sets of calculation logic, real-time data and batch data are often inconsistent, and the accuracy is difficult to distinguish;
  • Reuse Kafka message queue: Kafka reservation is often reserved by day or month, which can not retain all data, and can not be analyzed by the existing ad hoc query engine.

2. Hudi

In order to solve the pain point of lambda architecture, we prepared a new generation of data Lake Technology architecture. At the same time, we also spent a lot of time investigating the existing data Lake technology, and finally chose Hudi as our storage engine.

  • Update / delete records: Hudi uses fine-grained file / record level indexes to support update / delete records. It also provides transaction guarantee for write operations and acid semantics. The query will process the last submitted snapshot and output the results based on this;
  • Change flow: Hudi provides flow support for obtaining data changes. It can obtain the incremental flow of all updated / inserted / deleted records in a given table from a given time point, and query the status data at different times;
  • Presto: our existing query engine is compatible with the existing spark technology.
  • Fast iteration speed of community update: Flink has supported two different ways of reading and writing operations, such as cow and mor.

Baixin bank's real-time data lake evolution scheme based on Apache Hudi

In the new architecture, we can see that we write all the data of real-time and batch paste source layer to Hudi storage and re write it to the new data Lake layer datalake (hive’s database). For historical reasons, in order to be compatible with the previous data warehouse model, we still retain the previous ODS layer, and the historical data warehouse model remains unchanged, but the data of the ODS paste source layer needs to be obtained from the datalake layer.

Baixin bank's real-time data lake evolution scheme based on Apache Hudi

  • Firstly, we can see that for the warehousing logic of new tables, we use Flink to write into datalake through the real-time computing platform (new paste source layer, Hudi format storage). Data analysts and data scientists can directly use the data of datalake layer for data analysis and machine learning modeling. If the data warehouse model needs to use the data source of datalake, one layer of ODS transformation logic is required. The transformation logic here is divided into two cases:

    1. First, for the incremental model, users only need to put the partition of the latest datalake into ODS using snapshot query.
    2. Second, for the full volume model, the user needs to merge the snapshot of the day before ODS and the results of the latest snapshot query of datalake to form the latest snapshot, and then put it into the current partition of ODS, and so on.

The reason why we do this is that we do not need to transform the existing warehouse model, but change the data source of ODS to datalake, which has strong timeliness. At the same time, it meets the demands of data analysis and data scientists to obtain data in quasi real time.

  • In addition, for the original ODS data, we developed a script to initialize the ODS layer data into datalake once.

    1. If the ODS layer data is full snapshots every day, we only initialize the latest snapshot data to the same partition of datalake, and then enter the link access of datalake in real time;
    2. If the ODS layer data is incremental, we will not initialize temporarily, but only rebuild a real-time link into the lake in datalake, and then do an incremental daily switch to ODS every day.
  • Finally, if the data is entered into the lake at one time, we can import it into datalake using the tool of batch entering the lake.

The logic diagram of overall Lake warehouse conversion is as follows:

Baixin bank's real-time data lake evolution scheme based on Apache Hudi

3. Technical challenges

  • At the beginning of our research, Hudi’s support for Flink was not very mature. We did a lot of development and testing on spark – strunctstreaming. From our POC test results,

    1. If the partition free cow write method is used, the write will be slower and slower when the write volume is tens of millions of levels;
    2. Later, we changed non partitioned to incremental partitioned writing, which improved the speed a lot.

The reason for this problem is that spark will read the basefile file index when writing. The larger the file, the slower it will read the file index. Therefore, it will write more and more slowly.

  • Meanwhile, with Flink’s support for Hudi getting better and better, our goal is to integrate the function of Hudi into the lake into the real-time computing platform. Therefore, we integrated and tested Hudi on the real-time computing platform, and encountered some problems during the process. Typical problems are:

    1. Class conflict
    2. The class file cannot be found
    3. Rocksdb conflict

In order to solve these incompatibilities, we reconstructed an independent module based on Hudi’s dependency. This project just packages Hudi’s dependency into a shadow package.

4. When there are dependency conflicts, we will exclude the dependency conflicts related to Flink module or Hudi module.
  5. If other dependency packages cannot be found, we will import related dependencies through POM files.
  • In the scheme using Hudi on Flink, there are also related problems. For example, the checkpoint takes too long to fail because the checkpoint is too large. To solve this problem, we set the TTL time of the state, change the full checkpoint to the incremental checkpoint, and improve the parallelism.
  • Selection of cow and mor. At present, most of the Hudi tables we use are cow, so we choose cow,

    1. The first reason is that the data of our current historical stock ODS is imported into the datalake data table at one time, and there is no write amplification.
    2. Another reason is that the workflow of cow is relatively simple and does not involve additional operations such as compaction.

If the datalake data is newly added, there are a large number of updates, and the real-time requirements are high, we prefer to write in mor format. Especially when the QPS is large, we will use asynchronous compaction to avoid write amplification. In addition to this, we prefer to write in cow format.

4、 The future of Baixin bank’s real-time data Lake

In the architecture of our bank’s real-time data lake, our goal is to build the whole link of real-time data warehouse on Hudi. The architecture system is shown in the figure:

Baixin bank's real-time data lake evolution scheme based on Apache Hudi

Our overall goal is to replace Kafka, take Hudi as the intermediate storage, build the data warehouse on Hudi, and use Flink as the flow batch integrated computing engine. The benefits are:

  • MQ no longer acts as the intermediate storage medium for real-time data warehouse storage, while Hudi is stored on HDFS, which can store massive data sets;
  • The middle layer of real-time data warehouse can use OLAP analysis engine to query intermediate result data;
  • In the real sense, the batch flow is integrated, and the problem of data t + 1 delay is solved;
  • Schema type is no longer required to be strictly defined when reading schema, and schema evolution is supported;
  • It supports primary key index, increases data query efficiency several times, and supports acid semantics to ensure that data is not repeated and lost;
  • Hudi has the function of timeline, which can store more status data in the middle of the data, and the data integrity is stronger.

5、 Summary

This paper introduces the construction of real-time computing platform of Baixin bank, the scheme and practical method of building real-time data lake on Hudi, and the way of integrating Hudi and using Hudi.

In the process of using Hudi, I also encountered some problems. I sincerely thank the community students for their help. Special thanks to Danny Chan and leesf for their questions and answers. Under the real-time data Lake architecture system, we are still exploring the solution of building our real-time data warehouse and integrating flow and batch.