Flink + iceberg, Tencent’s 10 billion real-time data into the lake

Time:2022-1-5

Introduction:Flink meetup of Shanghai station shares content, and Tencent data Lake’s 10 billion level data scene landing case sharing.

This paper collates the actual combat of 10 billion real-time data entering the lake shared by Chen Junjie, senior engineer of Tencent data Lake R & D, at Flink meetup in Shanghai station on April 17. The content of the article is as follows:

  1. Introduction to Tencent data Lake
  2. 10 billion data scenario landing
  3. Future planning
  4. summary

GitHub address
https://github.com/apache/flink
Welcome to like Flink and send star~

1、 Introduction to Tencent data Lake

Flink + iceberg, Tencent's 10 billion real-time data into the lake

As can be seen from the above figure, the whole platform is relatively large, including data access, upper level analysis, intermediate management (such as task management, analysis management and engine management), and then the table format at the lowest level.

2、 10 billion level data landing scenario landing

1. Traditional platform architecture

Flink + iceberg, Tencent's 10 billion real-time data into the lake

As shown in the figure above, there were only two traditional platform architectures in the past, one was lambda architecture and the other was kappa architecture:

  • In lambda architecture, batches and streams are separated, so there are two sets of clusters for operation and maintenance, one for spark / hive and the other for Flink. There are several problems:

    • First, the cost of operation and maintenance is relatively large;
    • The second is the development cost. For example, in terms of business, spark and Flink or SQL should be written. Generally speaking, the development cost is not particularly friendly to data analysts.
  • The second is kappa architecture. In fact, it is the message queue, the transmission to the bottom layer, and then do some analysis later. Its characteristic is relatively fast, and it has certain real-time based on Kafka.

These two architectures have their own advantages and disadvantages. The biggest problem is that the storage may be inconsistent, resulting in the fragmentation of data links. At present, our platform has been connected to iceberg. The following will explain the problems encountered and the process of solving them according to different scenarios.

2. Scenario 1: hand Q safety data entering the lake

Flink + iceberg, Tencent's 10 billion real-time data into the lake

Mobile QQ security data into the lake is a very typical scenario.

The current business scenario is that the message queue tubemq is implemented into ods to iceberg through Flink, and then Flink is used to associate some user tables, and then a wide table is made to do some queries and put it into cos, which may be analyzed in the Bi scenario.

This process seems ordinary, but you should know that the user association dimension table of hand q is 2.8 billion, and the daily message queue is 10 billion, so it will face some challenges.

  • Small file challenge

    1. Flink writer generates small files

      Flink writes without shuffle, and the distributed data is out of order, resulting in many small files.

    2. High delay requirements

      The checkpoint interval is short and the commit interval is small. There is a problem with resizing files.

    3. Small file explosion

      In a few days, the small files of metadata and data exploded at the same time, and the cluster was under great pressure.

    4. Merge small files and enlarge problems

      In order to solve the problem of small files, open action to merge small files, resulting in more files.

    5. Too late to delete data

      Delete snapshots and orphan files, but there are too many scanned files, and namenode is under great pressure.

  • Solution

    1. Flink sync merge

      • Add small file merge operators;
      • Add snapshot automatic cleaning mechanism.

        1)snapshot.retain-last.nums

        2)snapshot.retain-last.minutes

    2. Spark asynchronous merge

      • Add background services to merge small files and delete orphan files;
      • Add small file filtering logic and delete small files step by step;
      • The logic of merging by partition is added to avoid generating too many deleted files at one time, resulting in task oom.
  • Flink sync merge

Flink + iceberg, Tencent's 10 billion real-time data into the lake

After committing all data files, a commit result will be generated. We will take the commit result to generate a compressed task, and then give it to multiple task managers to rewrite, and finally commit the result to the iceberg table.

Of course, the key is how compacttaskgenerator does it. At the beginning, we wanted to merge as much as possible, so we scanned the table and scanned many files. However, its table is very large and there are many small files. A sweep makes the whole Flink hang up immediately.

We thought of a way to scan the data incrementally after each merge. Make an increment from the previous replace operation to the present to see how much has been added and what conforms to the rewrite strategy.

In fact, there are many configurations to see how many snapshots have been reached, or how many files can be merged. Users can set these places themselves. Of course, we also have default values to ensure that users use these functions without perception.

  • Fanout writer’s pit

Flink + iceberg, Tencent's 10 billion real-time data into the lake

When using fanout writer, you may encounter multiple partitions if the amount of data is large. For example, the data of hand q is divided into provinces and cities; But it was still very big after the score, so it was divided into buckets. At this time, each task manager may be divided into many partitions. If one writer is opened for each partition, there will be many writers, resulting in insufficient memory.

Here we do two things:

  • The first is keyby support. Do the keyby action according to the partition set by the user, and then gather the of the same partition in one task manager, so that it will not open so many partition writers. Of course, this approach will bring some performance losses.
  • The second is to be an LRU writer and maintain a map in memory.

3. Scenario 2: news platform index analysis

Flink + iceberg, Tencent's 10 billion real-time data into the lake

Above is the online index architecture of news articles based on iceberg stream batch integration. On the left is the dimension table above the HDFS collected by spark, and on the right is the access system. After collection, Flink and dimension table will be used to make a window based join, and then written to the index pipeline table.

  • function

    • Quasi real time detail layer;
    • Real time streaming consumption;
    • Streaming merge into;
    • Multidimensional analysis;
    • Offline analysis.
  • Scene characteristics

    The above scenario has the following characteristics:

    • Order of magnitude:Index single table exceeds 100 billion, single batch 20 million, daily average 100 billion;
    • Delay demand:End to end data visibility minute level;
    • Data source:Full volume, quasi real-time increment and message flow;
    • Consumption mode:Streaming consumption, batch loading, point query, line update, multidimensional analysis.
  • Challenge: merge into

    Some users have put forward the needs of merge into, so we think from three aspects:

    • Function:Merge the flow table after each batch join into the real-time index table for downstream use;
    • Performance:The downstream requires high index timeliness, so it needs to be considered that the merge into can catch up with the upstream batch consumption window;
    • Ease of use:Table API? Or action API? Or SQL API?
  • Solution

    1. First step

      • Design joinrowprocessor with reference to delta Lake;
      • Use iceberg’s WAP mechanism to write temporary snapshots.
    2. Step 2

      • Optionally skip cardinality check;
      • When writing, you can choose to only hash without sorting.
    3. Step 3

      • Support dataframe API;
      • Spark 2.4 supports SQL;
      • Spark 3.0 uses the community version.

4. Scenario 3: advertising data analysis

  • Advertising data mainly has the following characteristics:

    • Order of magnitude:100 billion Pb data per day, 2K per piece;
    • Data source:Sparkstreaming increment into the lake;
    • Data characteristics:Tags keep increasing and schema keeps changing;
    • Usage:Interactive query analysis.
  • Challenges encountered and corresponding solutions:

    • Challenge 1:The nesting of schema is complex. After tiling, nearly 10000 columns can be written in oom.

      Solution:By default, each parquet page size is set to 1m, which needs to be set according to the executor memory.

    • Challenge 2:30 day data basic cluster burst.

      Solution:Provide action for life cycle management, and distinguish between document life cycle and data life cycle.

** * challenge 3: * * interactive query.
    
    **Solution:**
    
    *   1)column projection;
    *   2)predicate push down。

3、 Future planning

The future planning is mainly divided into kernel side and platform side.

1. Core side

In the future, we hope to have the following plans on the kernel side:

  • More data access

    • Incremental Lake entry support;
    • V2 format support;
    • Row identity support.
  • Faster queries

    • Index support;
    • Alloxio acceleration layer support;
    • Mor optimization.
  • Better data governance

    • Data governance action;
    • SQL extension support;
    • Better metadata management.

2. Platform side

On the platform side, we have the following plans:

  • Data governance Service

    • Metadata cleaning service;
    • Data governance service.
  • Incremental Lake entry support

    • Spark consumption CDC entering the lake;
    • Flink consumes CDC into the lake.
  • Indicator monitoring alarm

    • Write data indicators;
    • Small file monitoring and alarm.

4、 Summary

Through the application and practice in mass production, we get three conclusions:

  • usability:Through the actual combat of multiple business lines, it is confirmed that iceberg can withstand the test of 10 billion or even 100 billion per day.
  • Ease of use:The use threshold is relatively high, and more work needs to be done to enable users to use it.
  • Scenario support:At present, there are not as many supported Lake entry scenes as Hudi, and incremental reading is also missing, which needs to be supplemented with efforts.
    • *

In addition, the e-book “Apache Flink – real time computing at the right time” has been heavily released. This book will help you easily get the latest features of Apache Flink version 1.13. At the same time, it also contains the multi scene Flink practical experience of well-known manufacturers, integrating learning with application and a lot of dry goods! Click the link below to get it ~

https://developer.aliyun.com/article/784856?spm=a2c6h.13148508.0.0.61644f0eskgxgo

Flink + iceberg, Tencent's 10 billion real-time data into the lake

Copyright notice:The content of this article is spontaneously contributed by Alibaba cloud real name registered users, and the copyright belongs to the original author. Alibaba cloud developer community does not own its copyright or bear corresponding legal liabilities. Please refer to Alibaba cloud developer community user service agreement and Alibaba cloud developer community intellectual property protection guidelines for specific rules. If you find any content suspected of plagiarism in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the content suspected of infringement.