Multi stream splicing practice scheme based on Apache Hudi

Time:2022-5-25

In the business scenario of building wide tables in real-time data warehouse, byte jump data Lake team explored and practiced a new solution based on Hudi payload consolidation mechanism.

In the business scenario of building wide tables in real-time data warehouse, byte jump data Lake team explored and practiced a new solution based on Hudi payload consolidation mechanism.

This scheme provides the association ability of multi stream data in the storage layer, and aims to solve a series of problems encountered by multi stream join in real-time scene. Next, this paper will introduce the background and practical experience of multi stream splicing scheme in detail.

Business challenges

There are many business scenarios with byte runout. It is necessary to build a large and wide table in real time based on multiple data sources with the same primary key. The data sources generally include indicator data in Kafka and dimension data in kV database.

The business side usually produces this wide table by joining multiple data sources on the stream based on the real-time computing engine, but this solution faces many challenges in practice, which can be divided into the following two cases:

  1. Dimension table join
  • Scenario challenge: indicator data is associated with dimension data. The amount of dimension data is relatively large and the QPS of indicator data is relatively high, which may lead to data output delay.

  • Current scheme: cache some dimension data to alleviate the task back pressure caused by accessing the dimension data storage engine under high QPS.

  • Existing problems: due to the large time difference between the dimension data and indicator data of the business party, the indicator data flow cannot set a reasonable TTL; Moreover, the dimension data in the cache is not updated in time, resulting in inaccurate downstream data.

  1. Multi stream join
  • Scenario challenge: multiple indicator data are associated, and different indicator data may have large time difference.
  • Current scheme: use window based join and maintain a relatively large state.
  • Existing problems: maintaining a large state will not only bring some pressure to the memory, but also the time of checkpoint and restore will become longer, which may lead to task back pressure

Analysis and Countermeasures

The main challenges can be summarized as follows:

  • Due to the large time difference between multiple streams, it is necessary to maintain a large state, and TTL is not easy to set.
  • Due to the cache of dimension data, the dimension data is not updated in time, resulting in inaccurate downstream data.

In view of these problems, combined with the background that the business scenario has a certain tolerance for data delay but requires high data accuracy, we have explored a multi stream splicing scheme based on Hudi payload mechanism in continuous practice:

  • Multi stream data is completely spliced in the storage layer and has nothing to do with the computing engine. Therefore, it is not necessary to retain the state and its TTL settings.
  • Dimension data and indicator data are updated independently as different streams. There is no need to merge multi stream data in the update process, and merge multi stream data when reading downstream. Therefore, there is no need to cache dimension data. At the same time, merge can be performed during compact to speed up downstream queries.

In addition, the multi stream splicing scheme also supports:

  • The built-in general template supports general interfaces such as data De duplication, and can meet the customized data processing needs of users.
  • Support offline scenes and mixed flow batch scenes.

Scheme introduction

Basic concepts

First, briefly introduce some core concepts that the scheme relies on Hudi:

  • Hudi MetaStore

This is a centralized data Lake metadata management system. It implements concurrent write control based on timeline optimistic lock and supports column level conflict checking. It is very important to realize concurrent writing in Hudi multi stream splicing scheme. For more details, please refer to rfc-36 contributed by byte beating data Lake team to the community.

  • Mergeonread table read / write logic

There are two kinds of files in the mergeonread table, logfile (row storage) and basefile (column storage), which are suitable for real-time and high-frequency update scenarios. The updated data will be written directly into the logfile and merged when reading. In order to reduce the problem of read amplification, logfiles will be merged into basefile regularly. This process is called compact.

Principle overview

For the above business scenarios, we designed a multi stream splicing scheme completely based on the storage layer, which supports concurrent writing of multiple data streams, merging multi stream data according to the primary key during reading, and also supports asynchronous compact to speed up downstream data reading.


Figure 1 Hudi multi stream splicing concept diagram (the example data in all figures in this paper are consistent with figure 1)

The principle of the scheme is described with a simple example flow. Figure 1 is the schematic diagram of multi stream splicing. The wide table in the figure contains five columns of BCDE, which is spliced by two real-time streams and one offline stream. A is the primary key column, real-time stream 1 is responsible for writing three columns of ABC, real-time stream 2 is responsible for writing two columns of AD, and offline stream is responsible for writing two columns of AE. Only the splicing process of two real-time streams is introduced here.

Figure 1 shows that the written data of the two streams are stored in the form of logfile. The merge process is to merge the data in logfile and basefile. During the merging process, the value of each column in the logfile is updated to the corresponding column in the basefile. The columns that have not been updated in the basefile remain unchanged. As shown in Figure 1, the three BCD columns are updated to new values, and the e column remains unchanged.

Write process

The multi stream data splicing scheme supports multi stream concurrent writing and is independent of each other. For the writing of a single stream, the logic is consistent with the original writing process of Hudi, that is, the data is written to Hudi table in the way of upsert, stored in the form of logfile, and the data is de duplicated in the process of data writing. In the scenario of multi stream writing, the core point is how to deal with concurrency.

Figure 2 shows the flow of concurrent data writing. Stream 1 and stream 2 are two concurrent tasks. Check whether the columns written by these two tasks have other intersections except the primary key. For example:

The schema of stream 1 contains three columns (a, B, c), and the schema of stream 2 contains two columns (a, d).
When writing concurrently, first check the column conflict of deltacommit initiated by the two tasks in Hudi Metastore, that is, whether there is intersection between other columns except the primary key column, as shown in (B, c) and (d) in the figure:

  • If there is an intersection, the deltacommit initiated later fails.
  • If there is no intersection, the two tasks continue the subsequent writing.


Figure 2 Schematic diagram of data writing process

Read process

Next, the core process of snapshot query in the multi stream splicing scenario is introduced, that is, the logfile is de duplicated and merged first, and then the data in basefile and de duplicated logfile are merged. Figure 3 shows the whole data consolidation process, which can be divided into the following two processes:

  • Merge LogFile
    Hudi’s existing logic is to read out the data in the logfile and store it in the map. For each record in the logfile, if the key does not exist in the map, it will be directly put into the map. If the key already exists in the map, it needs to be updated.

In multi stream splicing, because there are data written by different data streams in logfile, that is, the columns of each data may be different, it is necessary to judge whether two records with the same key come from the same stream when updating. If yes, update and not splice.

As shown in Figure 3, when reading the record whose primary key in logfile2 is key1, the record corresponding to key1 already exists in the map, but the two records come from different streams, so it is necessary to splice to form a new record (key1, b0_new, c0_new, d0_new) and put it into the map.

  • Merge BaseFile and LogFile

Hudi’s existing default logic is to check whether there is a record with the same key in the map for each record existing in the basefile. If so, the record in the map will be used to overwrite the record in the basefile. In multi stream splicing, the record in the map will not completely overwrite the corresponding record in the basefile, and may only update the values of some columns, that is, the columns corresponding to the record in the map.

As shown in Figure 3, taking the simplest overlay logic as an example, when reading the record whose primary key in basefile is key1, it is found that key1 already exists in the map and the corresponding record has three BCD column values, then update the BCD column in basefile to get a new record (key1, b0_new, c0_new, d0_new, E0). Note that column E has not been updated, so keep the original value E0.
For the new key, such as the record corresponding to Key3, the BCE three columns need to be supplemented with the default value to form a complete record.


Figure 3 data merging process in snapshot query

Asynchronous comparison

In order to improve the reading performance, the write tasks of some data sources will execute the comparison synchronously, but in practice, it is found that the synchronous execution of the comparison will block the write task, and the comparison task requires more resources, which may preempt the resources of the streaming import task.

For such scenarios, separate comparison tasks and streaming data import tasks are isolated through an independent comparison service. Different from the asynchronous comparison provided by Hudi itself, the user does not need to specify the comparison instant to be executed, and an independent comparison service is responsible for the comparison operation of all tables. The details about the comparison service are not discussed in this article. For details, please refer toRFC-43

The specific process is to synchronously generate the schedule comparison plan by the streaming import task, and store the plan in Hudi Metastore. There is an async comparator independent of the streaming import task, which pulls the comparison plan from Hudi Metastore loop and executes it.

Scenario practice and future planning

Finally, based on the Hudi multi stream splicing scheme, it is implemented on the DWS layer of the real-time data warehouse. The single table supports the concurrent import of 3 + data streams, covering hundreds of TB of data.

In addition, when using spark to query wide table data, in the query with a single scanning amount of tens of TB, the performance is improved by more than 200% compared with the direct use of multi table Association. In some more complex queries, the performance is also improved by 40-140%.

At present, the multi stream splicing scheme based on Hudi is not easy to use, and at least more than 10 parameters need to be configured for a single task. In order to further reduce the user’s use cost, some SQL syntax support for column insertion and update and parameter convergence will be done later.

In addition, in order to further improve the query performance of wide table data, it is also planned to support logfile based on column storage format in multi stream splicing scenario, and provide functions such as column clipping and push down under filtering conditions.

The data Lake team is hiring,
Welcome to pay attentionRunout byte data platformOfficial account with the same name

Related products

  • Volcanic engine Lake warehouse integrated analysis service Las

Serverless data processing and analysis service for Lake warehouse integrated architecture provides one-stop massive data storage, calculation and interactive analysis capabilities. It is fully compatible with spark, Presto and Flink ecology, and helps enterprises easily complete data value insight.Click to learn

  • Volcano engine e-mapreduce

It supports the construction of enterprise level big data analysis system of open source Hadoop ecosystem, which is fully compatible with open source, and provides Hadoop, spark, hive and Flink integration and management, so as to help users easily complete the construction of enterprise big data platform, reduce the threshold of operation and maintenance, and quickly form big data analysis capability.Click to learn