Hudi origin analysis — deepnova developer community

Time:2022-5-9

1. Overview

The official introduction of Hudi (Hadoop update delete incremental) is a data Lake framework that provides transaction support, row level update / deletes and change stream on the data lake. It was first developed by Uber in 2016, entered the open source community in 2017 and became the top project of Apache in 2020. Starting from the background conditions of Hudi’s birth, this paper will find out what problems Hudi first appeared to solve.

2. Near real-time scene requirements

With the development of big data technology, two mature computing models have been gradually developed:
One is the batch processing model. The technology stack is represented by Hadoop, which is characterized by large scale, high fault tolerance and high delay. It is mainly used in offline large-scale analysis scenarios; The other is the stream processing model. The technology stack is represented by stream processing frameworks such as strom / Flink, which is characterized by very low delay and is mainly used in real-time scenes requiring very low delay. These two models cover most big data application scenarios.

However, there is a fuzzy edge between stream processing and batch processing, that is, the delay is in the range of 5 minutes to 1 hour. Within this range, both batch processing technology and stream processing technology can be used, which is called near real-time demand. For example, statistics on the changes of some dimensional indicators in the past few minutes.
Hudi origin analysis -- deepnova developer community
Such scenes have the following three characteristics:
1. The delay degree is required to be at the sub hour level.
2. The data comes from the statistical analysis of business data (there may be multiple table joins).
3. The data will change during the business window period.

3. Solution of traditional model

3.1. Use batch processing to solve near real-time problems
Hudi origin analysis -- deepnova developer community
The batch processing model is oriented to the off-line computing scenario, and the data needs to be imported from the business database into the data warehouse. In this stage, select * from where condition is used to query data. Condition is usually a time dimension, and the selection will also comprehensively consider the impact of the business database to determine the startup cycle. Generally, the time period with small business volume in the early morning of the next day is selected to pull the data of the previous day. After the data import is completed, the calculation operation can be carried out.

When the batch processing model is used to solve the needs of processing near real-time scenarios, the scheduling cycle needs to be set to 5 minutes ~ 1 hour, which will bring some problems.

3.1.1 business database load

Select * from condition is a costly operation. Normal batch processing will be carried out when the business volume is small to avoid affecting the business. If the task interval is set to every minute or every hour to support near real-time services, it will put great pressure on the business database and may have an impact on the business.

3.1.2 data updates that cannot be ignored

It is also the problem of select. Obtaining the changed data through SQL query requires additional conditions. Not all cases can obtain the changed data through select. In the traditional offline processing scenario, the impact is relatively small. Because the time range and start-up cycle are basically above 24h, most of the data will not change again, and the data level at this time is day level, with a large amount of data. Even if some individual data are not updated, the impact on the final result can be ignored.

However, in the near real-time scenario: due to the shortening of the startup cycle, the probability of data change increases significantly during this period, and the total amount of data decreases significantly. The correctness of the final result cannot be guaranteed. Therefore, in the near real-time scenario, the impact of the problem that batch processing does not support data update may not be ignored.

3.2. Using stream processing to solve near real-time problems

Stream processing is mainly used to deal with real-time computing scenarios with low delay requirements, and generally deal with event driven scenarios. Therefore, near real-time scenes can be processed directly using the stream processing model. However, because the event oriented time of near real-time scene is between 5 minutes and 1 hour, compared with the events with second time window, using stream processing technology to process the data with time window of 5 minutes and 1 hour will bring greater cost.

3.2.1. Greater memory consumption

When processing the update / delete statement in the changelog of the business database, you need to operate based on the corresponding historical data, which requires a large amount of data to be loaded into memory. Compared with processing second level real-time tasks, the time window of near real-time tasks is between 5 minutes and 1 hour, which will lead to the memory resources of near real-time tasks that are several times or even tens of times that of second level real-time tasks.

3.2.2 peak valley resource mismatch

Because it needs real-time processing, the processing task needs to run all the time, and the resources occupied are greater than those required for batch processing. When the business peak comes, more resources are needed to support the business. When the business trough comes, these resources may not be used efficiently. If the cloud system is OK, the resources can be released. If the self built computer room, the extra resources may cause waste.

3.3 summary

When using batch processing and stream processing to deal with near real-time problems, they have different problems. Batch processing has data update processing problems that cannot be ignored; It can be said that the cost of killing chickens with a knife is higher. This will inevitably make people fall into a choice. Can there be a processing model that can just meet the needs? With this pursuit, a new processing model is proposed.

4. Incremental processing model

4.1 basic structure
Hudi origin analysis -- deepnova developer community
The incremental processing model is proposed for near real-time scenarios. Its basic structure can be considered as full data storage + incremental data processing. In the data import phase, the change flow capture is used to import the data into the storage engine. In the analysis and calculation phase, the incremental data is queried for result calculation

4.2. Difference from traditional processing model

4.2.1. Difference from batch processing

Compared with batch processing, the storage part of the incremental processing model can support update / delete. This can expand the way of data acquisition in batch processing from the original query data to both query data and process changelog.

4.2.2. Difference from stream processing

Compared with stream processing, the essence of data model is batch. Data processing requires periodic import and extraction calculation, and data needs to go through the process from memory to disk and then to memory.

4.3 advantages of incremental processing model in processing near real-time scenes

Since the incremental processing model is proposed for near real-time scenes, what are the advantages of incremental processing compared with the traditional two processing models

4.3.1 be able to cope with data changes

The incremental processing model has the ability to process changelog by introducing a storage engine that supports update / delete. When facing the changing data flow in the near real-time scene, there will be no problem that the data cannot handle the data change

4.3.2. Reduce the impact on the business library by handling changelog

Reduce the impact on the database by capturing the changelog to avoid the select query. Generally, the logs of production level business database (OLTP) need to be generated, processed, dropped, archived, master-slave synchronization and other actions in the database itself to ensure production level ha. You only need additional threads to get the changelog. In addition, generally, changelog only needs one El job to be imported into Kafka as a buffer, and the El job of incremental model only needs to consume the changelog in Kafka. Compared with the irregular query of historical data from the business database, the load is greatly reduced.

4.3.3 reduce memory resource occupation through batch processing

The incremental processing model writes the changelog into the storage engine and queries the data to be analyzed from the storage engine. Compared with the flow processing, the data of the incremental processing model does not need to be loaded into memory.

4.3.4. Obtain more flexible scheduling space through micro batch computing

Micro batch means that data import and computing jobs do not need to run continuously like stream processing jobs, which leaves more schedulable space. Under the same computing resources, with the help of a reasonable scheduling system, more computing jobs can be accommodated and run at the same time, so as to save costs for enterprises.

4.4 summary

Incremental model can be said to be a special batch or a special flow. Compared with flow, it is actually a batch process running periodically. Compared with batch, it has the ability to handle change events. Its advantages in solving the needs of near real-time scenes are actually the balance between revenue and cost.

5. Summary

This paper introduces the problem that Hudi needs to solve – the demand of near real-time scene, and compares the shortcomings of two mature processing models in the field of big data: stream processing and batch processing. Finally, a new processing method of incremental processing model is introduced.

The incremental processing model takes data in the form of CDC and performs calculation and analysis in the form of micro batch to balance the delay and cost. In order to cope with data changes, the support of data update / deletion needs to be introduced. It is not difficult to find that the key point is to provide the ability of data update / deletion on the traditional big data storage engine. What specific work has Hudi done to support the implementation of incremental processing? Please pay attention to the next feature article.

6. Reference reference

  1. Uber’s case for incremental processing on Hadoop – O’Reilly (oreilly.com)
  2. Uber’s Big Data Platform: 100+ Petabytes with Minute Latency – Uber Engineering Blog
  3. Apache Hudi – The Data Lake Platform | Apache Hudi
  4. Big data stream processing architecture – Interpretation of Dipu technology Fastdata series