Practice of Flink integrating iceberg in Tongcheng Yilong

Time:2021-4-17

Introduction:This article is shared by Zhang Jun, a big data development engineer of citywide elong. It mainly introduces the production practice of citywide eLong Flink integrating iceberg.

This article is shared by Zhang Jun, a big data development engineer of citywide elong. It mainly introduces the production practice of citywide eLong Flink integrating iiceberg. The contents include:

  1. Background and pain points
  2. The landing of Flink + iceberg
  3. Iceberg Optimization Practice
  4. follow-up work
  5. Revenue and summary

1、 Background and pain points

Business background

Tongcheng Yilong is an online tourism service platform that provides air tickets, accommodation, transportation and other services. At present, my department belongs to the R & D Department of the company, and its main responsibility is to provide some basic services for other business departments in the company. Our big data system mainly undertakes some big data related data statistics and analysis work in the Department. Data sources include gateway log data, server monitoring data, k8s container related log data, APP management log, MySQL binlog log, etc. Our main big data task is to build real-time reports based on the above logs, provide report display and real-time query services based on presto, and develop some real-time and batch processing tasks based on Flink to provide accurate and timely data support for business parties.

Original architecture scheme

Because all our original data is stored in Kafka, the original technical architecture is that first of all, the data of Kafka is consumed by the Flink task, and then it is written to hive in real time after various processing of Flink SQL or Flink jar. Most of the tasks are Flink SQL tasks, because I think the SQL task The development of the relative code is much simpler, and easy to maintain, easy to understand, so can use SQL to write as much as possible.
The platform for submitting Flink uses Zeppelin, in which the function of submitting Flink SQL task is Zeppelin’s own, and the task of submitting jar package is my own Zeppelin plug-in based on application mode.
For the data landing in hive, the open source reporting system metabase (bottom layer uses PRESTO) is used to provide real-time report display, regular e-mail reports, and custom SQL query services. Due to the high real-time requirements of the business, we hope the data can be displayed as soon as possible, so the checkpoint of many of our Flink streaming tasks is set to 1 minute, and the data format is ORC.

Pain point

Because the column storage format Orc is used, it can’t be appended like the row storage format, so a very common and difficult problem in the field of big data is inevitable, that is, the problem of small HDFS files.

At the beginning of our small file solution is to write a small file compression tool, to merge regularly, our hive Partitions are generally day level, so the principle of this tool is to start a scheduled task every morning to compress yesterday’s data. First, write yesterday’s data into a temporary folder. After compression, compare and check the number of records with the original data. After the number of data is consistent, use the compressed data to cover the original data. However, the transaction cannot be guaranteed So there are many problems

  • At the same time, due to the delay in the arrival of data, there is data written to yesterday’s hive partition. The test will fail, resulting in the failure of merging small files.
  • There is no transaction guarantee for the replacement of old data. If new data is written to the old partition during the replacement, the new data will be covered, resulting in data loss.
  • Without the support of transaction, the data of the current partition cannot be merged in real time, and the data of the previous partition can only be merged and compressed. The latest partition data still has the problem of small files, which leads to the query performance of the latest data can not be improved.

2、 The landing of Flink + iceberg

Iceberg technology research

Therefore, based on the above problems of small HDFS files and slow query, combined with our current situation, I investigated the current data Lake technologies on the market: Delta, Apache iceberg and Apache Hudi, and considered the functions supported by the current data Lake framework and the future community planning. Finally, we chose iceberg for the following reasons:

■ iceberg deeply integrates Flink

As mentioned earlier, most of our tasks are Flink tasks, including batch processing tasks and stream processing tasks. Currently, iceberg is the most perfect framework for integrating Flink into these three data Lake frameworks. If iceberg is used to replace hive, the cost of migration is very small, and users are almost unaware of it,
For example, our original SQL is like this:

INSERT INTO hive_catalog.db.hive_table SELECT * FROM kafka_table

After migrating to iceberg, you only need to modify the catalog.

INSERT INTO iceberg_catalog.db.iIcebergceberg_table SELECT * FROM kafka_table

Presto query is similar to this, just modify the catalog.

The design architecture of iceberg makes the query faster

Practice of Flink integrating iceberg in Tongcheng Yilong

In iceberg’s design architecture, the manifest file stores partition related information, data files related statistical information (max / min), etc. to query the data of some large partitions, you can directly locate the data you want, instead of listing the entire HDFS folder like hive. The time complexity is reduced from O (n) to 0 (n) In the speech of iceberg PMC chair Ryan blue, we saw that the task execution time of hit filter decreased from 61.5 hours to 22 minutes.

Using Flink SQL to write CDC data to iceberg

Flink CDC provides a way to read MySQL binlog directly, compared with the previous need to use canal to read binlog, write iceberg, and then consume iceberg data. The maintenance of two components is reduced, the link is reduced, and the maintenance cost and error probability are saved. And can achieve the perfect docking of importing full data and incremental data, so using Flink SQL to import MySQL binlog data into iceberg to do MySQL > iceberg import will be a very meaningful thing.

In addition, for our initial requirement of compressing small files, although iceberg can not achieve automatic compression at present, it provides a batch task, which can already meet our requirements.

■ hive table migrates iceberg table

Migration preparation

At present, all our data are stored in hive table. After verifying iceberg, we decided to migrate hive’s data to iceberg. So I wrote a tool to use hive’s data, and then create a new iceberg table to create corresponding metadata for it. However, when testing, we found that if we use this method, we need to write hive’s data Because if iceberg and hive use the same data file, the compression program will continue to compress the small files of iceberg table. After compression, the old data will not be deleted immediately, so hive table will find double data. Therefore, we adopt the double write strategy. The program that originally wrote hive will not move, and a new program will be started to write Iceberg, so that we can observe the iceberg table for a period of time. It can also be compared with the original hive data to verify the correctness of the program.

After a period of observation, nearly a few billion pieces of data per day, the compressed t-size hive table and iceberg table are not bad at all. So when there is no problem with the final comparison data, stop writing the hive table and use the new iceberg table.

Migration tools

I made this hive table migration iceberg table tool into an iceberg action based on Flink batch job and submitted it to the community, but it has not been merged yethttps://github.com/apache/iceberg/pull/2217. The idea of this function is to use hive’s original data to be fixed, create a new iceberg table, and then generate the corresponding metadata for the new iceberg table. If you need to, you can take a look at it first.

In addition, iceberg community also has a tool to migrate existing data to existing iceberg table, similar to hive’s load data inpath… Into table, which is made with spark’s stored procedure. You can also pay attention to the following:https://github.com/apache/iceberg/pull/2210

3、 Iceberg Optimization Practice

Compress small files

At present, compressing small files is an extra batch task. Iceberg provides a spark version of action. I found some problems when I did the function test. In addition, I’m not very familiar with spark. I’m afraid that it’s not easy to check if there’s a problem, so I implemented a Flink version with reference to spark version and fixed some problems Bug, optimize some functions.

Since our iceberg metadata is stored in hive, that is, we use hivecatalog, the logic of the compression program is to find out all iceberg tables in hive and compress them in turn. There is no filter condition for compression. Whether it is a partitioned table or a non partitioned table, the whole table is compressed. This is to handle some Flink tasks using eventtime. If there is a delay in the arrival of data, the data will be written to the previous partition. If the whole table is not compressed, only the partition of the current day will be compressed, and the new data written in other days will not be compressed.

The reason why the timing task is not started for compression is that, for example, a table is compressed in five minutes. If the compression task is not completed in five minutes, a new table is not submitted Snapshot, when the next timing task is started again, the data in the last unfinished compression task will be compressed again. Therefore, the strategy of compressing each table in turn can ensure that only one task in a table is compressed at a certain time.

Code example reference:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();Actions.forTable(env, table) .rewriteDataFiles() //.maxParallelism(parallelism) //.filter(Expressions.equal("day", day)) //.targetSizeInBytes(targetSizeInBytes) .execute();

At present, the system runs stably and has completed tens of thousands of task compression.

Practice of Flink integrating iceberg in Tongcheng Yilong

be careful:
However, for the newly released iceberg 0.11, there is also a known bug, that is, when the size of the file before compression is larger than the size to be compressed (targetsizeinbytes), it will cause data loss. In fact, this problem occurred when I first tested small file compression, and I raised a problem PR, my strategy is that data files larger than the target file do not participate in compression, but this PR was not incorporated into version 0.11. Later another brother in the community found the same problem and submitted a pr(https://github.com/apache/iceberg/pull/2196)The strategy is to split the large file into the target file size, which has been merged into the master and will be released in the next bug fix version 0.11.1.

Query optimization

Batch processing timing task

Currently, for batch processing tasks in scheduled scheduling, the SQL client of Flink is not as perfect as hive, such as executing hive-f to execute a file. And different tasks need different resources, parallelism and so on.

So I have encapsulated a Flink program, which is called to process, read the SQL in a specified file, and submit batch tasks. Control the resource and parallelism of the task on the command line.

/home/flink/bin/fFlinklinklink run -p 10 -m yarn-cluster /home/work/iceberg-scheduler.jar my.sql

■ optimization

For batch task query, I have done some optimization work, such as limit push down, filter push down, query parallelism inference, etc., which can greatly improve the query speed. These optimizations have been pushed back to the community and released in iceberg 0.11.

Mocha ITOM

■ clean up orphan files

  1. Scheduled task deletion

In the process of using iceberg, sometimes I submit a Flink task and stop it for various reasons. At this time, iceberg has not submitted the corresponding snapshot. In addition, due to some exceptions leading to program failure, there will be some isolated data files that are not in iceberg metadata. These files are unreachable and useless to iceberg. So we need to clean up these files just like the garbage collection of the JVM.

At present, iceberg provides a spark version of action to deal with these useless files. We adopt the same strategy as compressing small files to obtain all iceberg tables in hive. Perform a scheduled task every other hour to delete these useless files.

SparkSession spark = ...... Actions.forTable(spark, table) .removeOrphanFiles() //.deleteWith(...) .execute();
  1. Step on the pit

In the process of running the program, we have the problem of normal data files being deleted. After investigation, because the snapshot retention setting is one hour, the cleaning time of the cleaning program is also set to one hour. Through the log, we found that the cleaning program deleted the normal data. After checking the code, the same time should be set. When cleaning up the isolated file, other programs are reading the snapshot to be expired, resulting in the deletion of normal data. Finally, the cleaning time of this cleaning program is changed to the default of three days, and there is no problem of deleting data files.
Of course, to be on the safe side, we can replace the original method of deleting files with a backup folder. After checking that there is no problem, we can delete them manually.

Snapshot expiration processing

Our snapshot expiration policy is written together with the batch task of compressing small files. After compressing small files, we will process the snapshot expiration of tables. The current retention time is one hour. This is because for some large tables, there are many partitions and the checkpoint is relatively short. If the snapshot is too long, there will still be too many small files. We do not need to query the historical snapshot for the time being, so I set the snapshot retention time to one hour.

long olderThanTimestamp = System.currentTimeMillis() - TimeUnit.HOURS.toMillis(1);table.expireSnapshots()// .retainLast(20).expireOlderThan(olderThanTimestamp).commit();

Data management

After the data is written, when you want to see how many data files there are in the corresponding snapshot, you can’t know which is useful and which is useless by directly querying spark. So we need corresponding management tools. At present, Flink is not very mature. We can use the tools provided by spark3 to view it.

  1. DDL

At present, we do these operations through the Flink SQL client. Other related DDL operations can be performed using sparkhttps://iceberg.apache.org/spark/#ddl-commands

  1. DML

Some related data operations, such as deleting data, can be implemented through mysql. Presto only supports partition level deletion.

  1. show partitions & show create table

When we operate hive, there are some very common operations, such as show partitions, show create table, etc., which are not supported by Flink at present, so it is very inconvenient to operate iceberg. We have made our own modifications based on Flink 1.12, but we have not yet submitted them to the community. We will submit them to the Flink and iceberg community when we have time.

4、 Follow up work

  • Flink SQL connects CDC data to iceberg

At present, in our internal version, I have tested that I can use Flink SQL to write CDC data (such as MySQL binlog) to iceberg. Some work needs to be done to realize this function in the community version. I also submitted some relevant PR to promote this work.

  • Delete and update using SQL

For copy on write tables, we can use spark SQL to delete and update rows. For specific syntax, please refer to the test class in the source code

org.apache.iceberg . spark.extensions.TestDelete & org.apache.iceberg . spark.extensions.TestUpdate I can test these functions in the test environment, but I haven’t had time to update them to production.

  • Streaming read using Flink SQL

In the work, there will be some such scenarios. Due to the large data, iceberg’s data is only stored for a short time. Unfortunately, due to the wrong writing of the program and other reasons, there is nothing we can do to consume it earlier.
When streaming read of iceberg is introduced, these problems can be solved, because iceberg stores all the data. Of course, there is a premise that there is no special accuracy requirement for the data, such as reaching the second level, because at present, the transactions written by Flink to iceberg are submitted based on the Flink checkpoint interval.

5、 Revenue and summary

After about a quarter of research, testing, optimization and bug repair on iceberg, we have migrated the existing hive table to iceberg, which perfectly solved all the original pain points. At present, the system is running stably, and has gained a lot of benefits compared with hive

  • Reduced resources written by Flink

For example, in the default configuration, the original task of a Flink reading Kafka and writing hive needs 60 parallelism to avoid Kafka backlog. After changing to write iceberg, only 20 parallelism is enough.

  • Faster query speed

Earlier, when we talked about iceberg query, we won’t go to list the entire folder to get partition data like hive, but first get relevant data from the manifest file. The query performance has been significantly improved, and the query speed of some large reports has been increased from 50 seconds to 30 seconds.

  • Concurrent read write

Because of iceberg’s transaction support, we can read and write a table concurrently, Flink stream data into the lake in real time, compress small files at the same time, clean up expired files and snapshot programs, and clean up useless files at the same time. In this way, we can provide data in a more timely manner, achieve minute level delay, and greatly speed up the query of the latest partition data Iceberg’s acid feature can ensure the accuracy of data.

  • time travel

You can query the data at a certain time before retrospectively.

To sum up, we can use Flink SQL to read and write iceberg in batches and streams, compress small files in real time, use spark SQL to do some delete and update work and some DDL operations, and then use Flink SQL to write CDC data to iceberg. At present, I have contributed all the optimization and bug fix of iceberg to the community. Due to the limited level of the author, sometimes it is inevitable to make mistakes. Please give me your advice.

Author’s introduction:
Zhang Jun, Tongcheng Yilong big data development engineer
Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Recommended Today

Review of SQL Sever basic command

catalogue preface Installation of virtual machine Commands and operations Basic command syntax Case sensitive SQL keyword and function name Column and Index Names alias Too long to see? Space Database connection Connection of SSMS Connection of command line Database operation establish delete constraint integrity constraint Common constraints NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY DEFAULT […]