Building enterprise real time data Lake based on Flink + iceberg


Apache Flink is a very popular stream batch unified computing engine in the field of big data, and data lake is a new technology architecture conforming to the development trend of the cloud era. So when Apache Flink meets the data lake, what kind of sparks will it collide with? This sharing mainly includes the following core contents:

  1. The background of data lake is introduced;
  2. Introduction of classic business scenarios;
  3. Why Apache iceberg;
  4. How to flow into the lake through Flink + iceberg
  5. Community future planning.

Video review:…

Background of data Lake

What is the concept of data lake? Generally speaking, we maintain the data generated by an enterprise in a platform, which we call “data Lake”.

Looking at the picture below, the data sources of this lake are various, some may be structured data, some may be unstructured data, some may even be binary data. There is a wave of people standing at the entrance of the lake, using equipment to detect the water quality, which corresponds to the flow processing operation on the data Lake; there are a number of pumps pumping water from the lake, which corresponds to the batch processing operation of the data Lake; there are also a group of people fishing in the bow or on shore, which corresponds to the data scientists extracting data value from the data Lake through machine learning.

Building enterprise real time data Lake based on Flink + iceberg

  1. To sum up, there are four main characteristics of data lake.
  2. The first feature is the storage of raw data, which has rich sources;
  3. The second feature is to support multiple computing models;
  4. The third feature is that it has perfect data management ability. It should be able to access multiple data sources, realize the connection between different data, and support schema management and permission management;
  5. The fourth feature is flexible underlying storage. Generally, DS3, OSS and HDFS are used as cheap distributed file systems, and specific file formats and caches are adopted to meet the data analysis requirements of corresponding scenarios.

Building enterprise real time data Lake based on Flink + iceberg

So what is the open source data Lake architecture like? Here I draw an architecture diagram, which is mainly divided into four layers:

  1. At the bottom is the distributed file system. Users on the cloud will use more object storage such as S3 and OSS. After all, the price is much cheaper. Non cloud users generally use their own HDFS.
  2. The second layer is data acceleration layer. Data Lake architecture is an architecture that completely separates storage and computing. If all data access is to remotely read the data on the file system, the performance and cost will be very high. If we can cache some frequently accessed hot data in the local computing node, it is very natural to achieve the separation of hot and cold. On the one hand, we can get good local reading performance, on the other hand, we can save the bandwidth of remote access. In this layer, we usually choose open source aluxio or jindofs on alicloud.
  3. The third layer is the table format layer, which mainly encapsulates a batch of data files into a table with business significance and provides table level semantics such as acid, snapshot, schema and partition. It generally corresponds to open source projects such as Delta, iceberg and Hudi. For some users, they think that Delta, iceberg and Hudi are data lakes. In fact, these projects are only a part of the data Lake architecture. It is only because they are closest to the user and shield many of the underlying details that they create such an understanding.
  4. The top layer is the computing engine of different computing scenarios. Open source computing engines generally include spark, Flink, hive, presto, hive MR and so on. These computing engines can access tables of the same data lake at the same time.

Building enterprise real time data Lake based on Flink + iceberg

Introduction to classic business scenarios

So, what classic application scenarios can Flink combine with data lake? Here, when we discuss the business scenario, we select Apache iceberg as our data lake model by default. The reasons behind the model selection will be explained in detail in the following section.

Building enterprise real time data Lake based on Flink + iceberg

First of all, the most classic scenario of Flink + iceberg is to build a real-time data pipeline. A large amount of log data generated by the business end is imported into the message queue such as Kafka. After executing ETL with Flink flow computing engine, it is imported into Apache iceberg original table. Some business scenarios need to run the analysis job directly to analyze the data of the original table, while others need to further purify the data. Then we can start a new Flink job to consume incremental data from the Apache iceberg table, and write it to the purified iceberg table after processing. At this time, there may be businesses that need to further aggregate the data, so we continue to start the incremental Flink job on the iceberg table and write the aggregated data results to the aggregation table.

Some people will think that this scene seems to be realized through Flink + hive. Flink + hive can be realized, but the data written to hive is more for data analysis of data warehouse than for incremental pull. Generally speaking, the incremental write time of hive is more than 15 minutes based on partition. Long term and high frequency write of Flink will cause partition expansion. Iceberg allows one minute or even 30 seconds of incremental writing, which can greatly improve the real-time performance of end-to-end data. The upper analysis job can see the updated data, and the downstream incremental job can read the updated data.

Building enterprise real time data Lake based on Flink + iceberg

The second classic scenario is that you can use Flink + iceberg to analyze binlog from relational databases such as mysql. On the one hand, Apache Flink has natively supported CDC data parsing. After a binlog data is pulled through verica Flink CDC connector, it is automatically converted into insert, delete and update that Flink runtime can recognize_ BEFORE、UPDATE_ After four kinds of messages for users to do further real-time calculation.

On the other hand, Apache iceberg has perfectly implemented the function of equality delete, that is, the user defines the records to be deleted and writes them directly to the Apache iceberg table to delete the corresponding rows, which is to realize the streaming deletion of the data lake. In the future version of iceberg, users will not need to design any additional business fields and write a few lines of code to stream binlog to Apache iceberg (the pull request of the community has provided a prototype for writing CDC data by Flink).

In addition, after CDC data is successfully entered into iceberg, we will also get through common computing engines, such as presto, spark, hive, etc., which can read the latest data in iceberg table in real time.

Building enterprise real time data Lake based on Flink + iceberg

The third classic scenario is the stream batch unification of near real-time scenes. In the commonly used lambda architecture, we have a real-time link and an offline link. The real-time link is generally constructed by Flink, Kafka and HBase, while the offline link is generally constructed by parquet, spark and other components. There are many computing components and storage components involved, and the cost of system maintenance and business development is very high. There are many scenes whose real-time requirements are not so harsh. For example, they can relax to the minute level. This kind of scene is called near real-time scene. So, can we use Flink + iceberg to optimize our commonly used lambda architecture?

Building enterprise real time data Lake based on Flink + iceberg

We can use Flink + iceberg to optimize the whole architecture as shown in the figure above. Real time data is written to iceberg table through Flink. Near real-time link can still calculate incremental data through Flink. Offline link can also read a snapshot through Flink batch calculation for global analysis, and get the corresponding analysis results, which can be read and analyzed by users in different scenarios. After this improvement, we unify the computing engine into Flink and the storage components into iceberg, which greatly reduces the maintenance and development cost of the whole system.

Building enterprise real time data Lake based on Flink + iceberg

The fourth scenario uses iceberg full data and Kafka incremental data to bootstrap new Flink jobs. Our existing streaming jobs are running online. Suddenly, one day, a business party came and said that they encountered a new computing scenario and needed to design a new Flink job. They ran through the historical data of last year and received the Kafka incremental data that was being generated. So what should we do at this time?

We can still use the common lambda architecture. The offline link is synchronously written to the data Lake through Kafka > Flink > iceberg. Because Kafka has high cost, the data of the last seven days can be retained. Iceberg has low storage cost and can store the full amount of historical data (divided into multiple data intervals according to checkpoint). When starting a new Flink job, you just need to pull iceberg’s data, and then smoothly receive Kafka’s data after running.

Building enterprise real time data Lake based on Flink + iceberg

The fifth scene is a bit similar to the fourth. Also in the lambda architecture, the real-time link due to the problem of event loss or arrival order may lead to the result of stream computing is not necessarily completely accurate, at this time generally need a full amount of historical data to correct the result of real-time computing. And our iceberg can play this role well, because it can manage historical data with high cost performance.

Why Apache iceberg

Back to the question left in the previous section, why did Flink choose Apache iceberg among many open source data Lake projects at that time?

Building enterprise real time data Lake based on Flink + iceberg

We investigated Delta, Hudi and iceberg in detail and wrote a research report. We find that Delta and Hudi are too deeply bound to the code path of spark, especially the write path. After all, at the beginning of the design of these two projects, spark was more or less used as their default computing engine. Apache iceberg’s direction is very firm, and its purpose is to make a universal table format. Therefore, it perfectly decouples the computing engine and the underlying storage system, which is convenient for accessing diversified computing engines and file formats. It can be said that it correctly completes the implementation of the table format layer in the data Lake architecture. We think it is also easier to become an open source de facto standard for the table format layer.

On the other hand, Apache iceberg is developing towards the data Lake storage layer of stream batch integration. The design of manifest and snapshot can effectively isolate the changes of different transactions, which is very convenient for batch processing and incremental calculation. As we know, Apache Flink is already a computing engine integrating flow and batch. It can be said that the long-term planning of the two perfectly matches. In the future, the two will work together to build a data Lake architecture integrating flow and batch.

Finally, we also found that the community resources behind the Apache iceberg project are very rich. In foreign countries, Netflix, apple, LinkedIn, Adobe and other companies have Pb level production data running on Apache iceberg; in China, Tencent and other giants also have a huge amount of data running on Apache iceberg, and their biggest business has dozens of tons of incremental data written to Apache iceberg every day. Members of the community are also very experienced and diverse, with 7 Apache PMC members from other projects and 1 VP. As reflected in the review of code and design, it becomes very harsh. A slightly larger PR involving 100 + comments is very common. In my opinion, all these make Apache iceberg’s design + code quality relatively high.

Based on the above considerations, Apache Flink finally chose Apache iceberg as the first data Lake access project.

How to flow into the lake through Flink + iceberg

At present, we have implemented the function of Flink stream batch into the lake on Apache iceberg 0.10.0, and we also support Flink batch job to query the data of iceberg data lake. For details on how to read and write Apache iceberg table by Flink, please refer to the Apache iceberg community’s usage documents, which will not be repeated here.…

Here’s a brief introduction to the design principle of Flink iceberg sink: Iceberg uses optimistic lock to submit transactions, that is, when two people submit change transactions to iceberg at the same time, the later party will try again and again, and then reread the metadata information to submit the transaction after the first party successfully submits. In view of this, it is inappropriate to use multiple concurrency operators to submit transactions, which can easily cause a large number of transaction conflicts and lead to retries.

Therefore, we split the Flink write process into two operators, one is called iceberg streamwriter, which is mainly used to write records to the corresponding Avro, parquet and orc files, generate a corresponding iceberg datafile, and send it to the downstream operator; the other is called iceberg files Committee, which is mainly used to send all datafiles when the checkpoint arrives File collection, and submit transaction to Apache iceberg to complete the checkpoint data writing.

Building enterprise real time data Lake based on Flink + iceberg

After understanding the design of Flink sink operator, the next important question is: how to design the state of two operators correctly?

First of all, the design of iceberg streamwriter is relatively simple. Its main task is to convert records into datafiles. There is no complex state to design. Iceberg files committee is a bit more complicated. It maintains a datafile list for each checkpoint ID, that is, map < long, list < datafile > >. In this way, even if the transaction submission of a checkpoint in the middle fails, its datafile is still maintained in the state, and data can be submitted to iceberg table through subsequent checkpoints.

Community future planning, etc

With the release of Apache iceberg 0.10.0, the integration of Flink and iceberg has begun. In the future versions of Apache iceberg 0.11.0 and 0.12.0, we plan more advanced functions and features.

For Apache version 0.11.0, it mainly solves two problems:

The first thing is the problem of small file merging. Of course, Apache iceberg version 0.10.0 already supports Flink batch jobs to merge small files on a regular basis, which is relatively rudimentary. In version 0.11.0, we will design the function of automatically merging small files. In short, after the Flink checkpoint arrives and the Apache iceberg transaction is triggered to submit, there is a special operator to handle the merging of small files.

The second thing is the development of Flink streaming reader. At present, we have done some POC work in the private warehouse, and we will contribute to the Apache iceberg community in the future.

For version 0.12.0, it mainly solves the problem of row level delete. As mentioned earlier, we have implemented the full link connection of Flink upsert update data Lake in PR 1663. After the community has reached an agreement, the function will be gradually promoted to the community version. At that time, users will be able to write and analyze CDC data in real time through Flink, and can also easily upload the aggregate results of Flink to Apache iceberg.

Author’s introduction:

Hu Zheng (Ziyi), a technical expert of Alibaba, is currently mainly responsible for the design and development of the Flink data Lake scheme. He is a long-term active contributor of Apache iceberg and Apache Flink project, and the author of HBase principles and practice.