Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

Time:2021-10-19

Introduction:The practice of Lake warehouse integrated architecture based on Flink + iceberg was shared by Di Xingxing, head of real-time computing platform of automobile home, at meetup of Shanghai station on April 17.

Brief content:

1、 Background of data warehouse architecture upgrade

2、 Practice of Lake warehouse integrated architecture based on iceberg

3、 Summary and benefits

4、 Follow up planning

GitHub address
https://github.com/apache/flink
Welcome to like Flink and send star~

1、 Background of data warehouse architecture upgrade

1. Pain points of hive based data warehouse

The original data warehouse is completely built based on hive. There are three main pain points:

Pain point 1: acid is not supported

1) Upsert scenario is not supported;

2) Row level delete is not supported, and the cost of data correction is high.

Pain point 2: timeliness is difficult to improve

1) The data is difficult to be visible in quasi real time;

2) Unable to read incrementally and realize the stream batch unification at the storage level;

3) Data analysis scenarios with minute delay cannot be supported.

Pain point 3: table evolution

1) Write schema, poor support for schema change;

2) Partition spec change support is not friendly.

2. Iceberg key features

Iceberg has four key features: support for acid semantics, incremental snapshot mechanism, open table format and stream batch interface support.

  • Support acid semantics

    • Incomplete commit will not be read;
    • Support concurrent commit based on optimistic lock;
    • Row level delete supports upsert.
  • Incremental snapshot mechanism

    • The data is visible after commit (minute level);
    • Traceable historical snapshot.
  • Open table format

    • Data format: parquet, ORC, Avro
    • Computing engine: spark, Flink, hive, Trino / Presto
  • Stream batch interface support

    • Support stream and batch writing;
    • Supports stream and batch reading.

2、 Practice of Lake warehouse integrated architecture based on iceberg

The meaning of Lake warehouse integration is that I don’t need to see the lake and warehouse. The data has a connected metadata format. It can flow freely and connect with the diversified computing ecology of the upper layer.

——Jia Yangqing (senior researcher of Alibaba cloud computing platform)

1. Append link into the lake

Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

The above figure shows the link of log data into the lake. Log data includes client logs, client logs and server logs. These log data will be entered into Kafka in real time, then written into iceberg through Flink task, and finally stored in HDFS.

2. Connect Flink SQL to the lake

Our link of Flink SQL into the lake is completed based on “Flink 1.11 + iceberg 0.11”. We have mainly done the following to connect to iceberg Catalog:

1) Meta server adds support for iceberg catalog;

2) Iceberg catalog support is added to SQL SDK.

Then, on this basis, the platform opens the management function of iceberg table, so that users can build SQL tables on the platform.

3. Entering the lake – supporting agent users

The second step is internal practice, connecting with the existing budget system and authority system.

In the past, when the platform did real-time jobs, the platform was run by Flink users by default. The previous storage did not involve HDFS storage, so there may be no problem, and there was no consideration of budget division.

But writing iceberg now may involve some problems. For example, if the warehouse team has its own bazaar, the data should be written to their directory, and the budget should be transferred to their budget. At the same time, the permissions should be connected with the offline team account system.

Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

As shown above, this is mainly a proxy user function on the platform. Users can specify which account to write this data to iceberg. The implementation process mainly includes the following three steps.

  • Add table level configuration: ‘iceberg. User. Proxy’ = ‘targetuser’

    1) Enable superuser

    2) Team account authentication

    Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

  • Enable proxy users when accessing HDFS:

    Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

  • Specify a proxy user when accessing the history Metastore

    1) Refer to the related implementation of spark:

    org.apache.spark.deploy.security.HiveDelegationTokenProvider

    2) The dynamic agent hivemetastoreclient uses the proxy user to access hive Metastore

4. Example of Flink SQL entering the lake

DDL + DML

Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

5. CDC data access link

Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

As shown above, we have an autodts platform, which is responsible for real-time access to service library data. We will access the data of these service libraries into Kafka. At the same time, it also supports the configuration of distribution tasks on the platform, which is equivalent to distributing the data into Kafka to different storage engines. In this scenario, it is distributed to iceberg.

6. Connect Flink SQL CDC to the lake

The following are the changes we made to support CDC entering the lake based on “flink1.11 + iceberg 0.11”:

  • Improved iceberg sink:

    Flink version 1.11 is appendstreamtablesink, which cannot process CDC streams, modify and adapt.

  • Table management

    1) Support primary key (pr1978)

    2) Open V2 version: ‘iceberg. Format. Version’ = ‘2’

7. CDC data entering the lake

1. Support bucket

In the upsert scenario, you need to ensure that the same data is written to the same bucket. How can this be achieved?

At present, Flink SQL syntax does not support declaring bucket partitions. Declare buckets through configuration:

‘partition. Bucket. Source’ =’id ‘, / / specify the bucket field

‘partition. Bucket. Num’ =’10 ‘, / / specify the number of buckets

2. Copy-on-write sink

The reason for copy on write is that merge on read in the original community does not support merging small files, so we temporarily implemented copy on write sink. At present, the business has been tested and used with good results.

Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

The above is the implementation of copy on write, which is actually similar to the original merge on readStreamwriter multi parallelism writeandFilecommitter single degree of parallelism sequential submission

In copy on write, the number of buckets needs to be set reasonably according to the amount of data in the table, without additional small file consolidation.

  • Streamwriter writes multi parallelism in the snapshot state phase

    1) Add buffer;

    2) Before writing, you need to judge that the last checkpoint has been committed successfully;

    3) Group and merge by bucket, and write bucket by bucket.

  • Filecommitter single degree of parallelism sequential submission

    1)table.newOverwrite()

    2)Flink.last.committed.checkpoint.id

    Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

8. Example – CDC data configuration into the lake

Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

As shown in the figure above, in actual use, business parties can create or configure distribution tasks on the DTS platform.

Select iceberg table as the instance type, and then select the target library to indicate which table data to synchronize to iceberg. Then, you can select the mapping relationship between the fields of the original table and the target table. After configuration, you can start the distribution task. After startup, a real-time task will be submitted in the real-time computing platform Flink, and then the data will be written to the iceberg table in real time with copy on write sink.

Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

9. Other practices of entering the lake

Practice 1: reduce empty commit

  • Problem Description:

    When the upstream Kafka has no data for a long time, each checkpoint will still generate a new snapshot, resulting in a large number of empty files and unnecessary snapshots.

  • Solution (PR – 2042):

    Add the configuration flink.max-continuousempty-commits. The commit is triggered only after there is no data in the checkpoint for the specified number of times in a row to generate a snapshot.

Practice 2: record watermark

  • Problem Description:

    At present, iceberg table itself cannot directly reflect the progress of data writing, and offline scheduling is difficult to accurately trigger downstream tasks.

  • Solution (PR – 2109):

    In the commit phase, the watermark of Flink is recorded in the properties of iceberg table, which can intuitively reflect the end-to-end delay. At the same time, it can be used to judge the integrity of partition data for scheduling and triggering downstream tasks.

Practice 3: table deletion optimization

  • Problem Description:

    Deleting iceberg may be slow, causing the platform interface to time out accordingly. Because iceberg abstracts the IO layer from object-oriented storage, there is no way to quickly clear the directory.

  • Solution:

    Extend fileio and add the deletedir method to quickly delete table data on HDFS.

10. Small file consolidation and data cleaning

The batch task (spark 3) is executed regularly for each table, which is divided into the following three steps:

1. Regularly merge small files in new partitions:

​ rewriteDataFilesAction.execute(); Only small files are merged, and old files are not deleted.

2. Delete expired snapshots and clean up metadata and data files:

​ table.expireSnapshots().expireOld erThan(timestamp).commit();

3. Clean up orphan files. By default, clean up files that are inaccessible 3 days ago:

​ removeOrphanFilesAction.older Than(timestamp).execute();

11. Computing engine – Flink

Flink is the core computing engine of the real-time platform. At present, it mainly supports the scene of data entering the lake, which mainly has the following characteristics.

  • Quasi real time data entering the lake:

    Flink and iceberg have the highest integration in data entering the lake, and Flink community actively embraces data Lake technology.

  • Platform integration:

    Autostream introduces icebergcatalog to support creating tables and entering the lake through SQL. Autodts supports configuring mysql, sqlserver and tidb tables into the lake.

  • Flow batch integration:

    Under the concept of integration of flow and batch, Flink’s advantages will be gradually reflected.

12. Computing engine – hive

Hive is more integrated with iceberg and spark 3 in SQL batch processing, and mainly provides the following three functions.

  • Regular small file consolidation and meta information query:

    Select * from prod.db.table.history to view snapshots, files, and manifests.

  • Offline data writing:

    1)Insert into 2)Insert overwrite 3)Merge into

  • Analysis query:

    It mainly supports daily quasi real-time analysis and query scenarios.

13. Computing engine – Trino / Presto

Autobi has been integrated with Presto for reporting and analytical query scenarios.

14. Stepped pit

1. Exception accessing hive Metastore

Problem Description:The misuse of hiveconf’s construction method causes the configuration declared in hive client to be overwritten, resulting in an exception when accessing hive Metastore

Solution (pr-2075):Repair the structure of hiveconf and call the addresource method to ensure that the configuration will not be overwritten: hiveconf.addresource (CONF);

2. Hive Metastore lock not released

Problem Description:“Commitfailedexception: timed out after 181138 MS waiting for lock XXX.” the reason is that the hivemetastoreclient.lock method also needs to display unlock when the lock is not obtained, otherwise the above exception will be caused.

Solution (pr-2263):Optimize the hivetableoperations#acquireliock method to call unlock to release the lock in case of lock acquisition failure.

3. The metadata file is missing

Problem Description:Iceberg table cannot be accessed, reporting “notfoundexception failed to open input stream for file: XXX. Metadata. JSON”

Solution (pr-2328):When hive metadata is called to update the metadata of iceberg table\_ After the location timeout, add a check mechanism to confirm that the metadata is not saved successfully, and then delete the metadata file.

3、 Benefits and summary

1. Summary

Through the exploration of Lake Warehouse Integration and flow batch integration, we make a summary respectively.

  • Lake Warehouse Integration

    1) Iceberg supports hive Metastore;

    2) The overall use is similar to hive table: the same data format and the same computing engine.

  • Stream batch fusion

    Realize the unification of stream and batch in quasi real-time scenario: same source, same calculation and same storage.

2. Business income

  • Data timeliness improvement:

    The warehousing delay is reduced from more than 2 hours to less than 10 minutes; The SLA of the core task of the algorithm is completed 2 hours in advance.

  • Quasi real time analysis and query:

    Combined with spark 3 and Trino, it supports quasi real-time multidimensional analysis and query.

  • Efficiency improvement of characteristic Engineering:

    Provide quasi real-time sample data to improve the timeliness of model training.

  • Quasi real time warehousing of CDC data:

    You can perform quasi real-time analysis and query on the business table in the data warehouse.

3. Structure revenue – quasi real-time data warehouse

Practice of integrated Lake warehouse architecture of Auto Home Based on Flink + iceberg

As mentioned above, we support quasi real-time warehousing and analysis, which is equivalent to providing basic architecture verification for subsequent quasi real-time warehouse construction. The advantages of quasi real-time data warehouse are one-time development, unified caliber and unified storage. It is a real batch flow integration. The disadvantage is poor real-time performance. Originally, it may be a delay of seconds or milliseconds, but now it is a data visibility of minutes.

However, at the architecture level, this is of great significance. In the future, we can see some hopes that the whole original “t + 1” data warehouse can be made into a quasi real-time data warehouse to improve the overall data timeliness of the data warehouse, and then better support upstream and downstream businesses.

4、 Follow up planning

1. Follow up iceberg version

Fully open V2 format and support MOR of CDC data into the lake.

2. Construction of quasi real-time warehouse

Based on Flink, the speed of each layer of the log warehouse is comprehensively increased through the data pipeline mode.

3. Flow batch integration

With the gradual improvement of the function of upsert, we continue to explore the integration of stream and batch at the storage level.

4. Multidimensional analysis

Quasi real time multidimensional analysis based on Presto / spark3 output.

Copyright notice:The content of this article is spontaneously contributed by Alibaba cloud real name registered users, and the copyright belongs to the original author. Alibaba cloud developer community does not own its copyright or bear corresponding legal liabilities. Please refer to Alibaba cloud developer community user service agreement and Alibaba cloud developer community intellectual property protection guidelines for specific rules. If you find any content suspected of plagiarism in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the content suspected of infringement.