Data stack technology sharing: open source · data stack – extend flinksql to realize the join of flow and dimension table

Time:2022-4-2

I. extend flinksql to realize the join of flow and dimension table

Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension table

2、 Why extend flinksql?

1. Real time computing requires full SQL

SQL is the most widely used language in data processing. It allows users to state their business logic concisely. It is not common to use SQL to calculate large data in real time. In fact, using SQL to develop real-time tasks can greatly reduce the threshold of data development. In the kangaroo cloud stack real-time computing module, we decided to realize full SQL.

Advantages of using SQL for data calculation

☑ Declarative. Users only need to express what I want. As for how to calculate, it’s a matter of the system, and users don’t need to care.

☑ Automatic tuning. The query optimizer can generate the most execution plan for the user’s SQL. Users do not need to know it, they can automatically enjoy the performance improvement brought by the optimizer.

☑ Easy to understand. Many people in different industries and fields understand SQL. The learning threshold of SQL is very low. Using SQL as a cross team development language can greatly improve efficiency.

☑ stable. SQL is a language with decades of history. It is a very stable language with few changes. Therefore, when we upgrade the version of the engine, or even replace it with another engine, we can upgrade it compatibly and smoothly.

Reference link:https://blog.csdn.net/weixin_…

2. Real time calculation also requires the join of flow and dimension table

Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension table

In the world of real-time computing, not only the join of flow and flow, but also the join of flow and dimension table. Last year, kangaroo cloud stack v3 During the development of version 0, the latest version at that time – flink1 Flinksql in 6 has applied the advantages of SQL to Flink engine, but it does not support the join of flow and dimension table.

Flinksql started to open streaming computing services to Alibaba Group in July 2017. Although it is a very young product, it has supported thousands of jobs during the double 11 period. During the double 11 period, the peak processing of blink jobs reached 500 + 100 million per second, of which the total peak processing of flinksql jobs alone reached 300 million per second.

Reference link:https://yq.aliyun.com/article…

First, explain what a dimension table is; Dimension table is a dynamic table. The data stored in the table may remain unchanged or may be updated regularly, but the update frequency is not very frequent. In business development, the general dimension table data is stored in relational databases, such as MySQL and Oracle, or in NoSQL databases such as HBase and redis.

3、 Flinksql implements the join of flow and dimension table step by step

1. Using Flink API to realize the function of dimension table

To realize the dimension table function, we need to use the function of Flink aysnc I / O, which is contributed by Alibaba to Apache Flink.

Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension table

Async I / O is contributed to the community by Alibaba and was introduced in version 1.2. The main purpose is to solve the problem that network delay has become a system bottleneck when interacting with external systems.

See this article for details:http://wuchong.me/blog/2017/0…

The API corresponding to Flink is the abstract class richasyncfunction. The following abstract class implements the open (initialization), asyncinvoke (data asynchronous call) and close (some stopped operations) methods. The most important thing is to implement the methods in asyncinvoke.

The join of flow and dimension table will encounter two problems:

1) The first is performance.

Because if the flow rate is fast, every piece of data needs to be joined to the dimension table, but the data of the dimension table exists in the third-party storage system. If you access the third-party storage system in real time, not only the performance of the join will be poor, but also the network IO will be used every time; It will also bring great pressure to the third-party storage system, which may hang up.

Therefore, the solution is to cache the data in the dimension table, which can be fully cached. This is mainly when the dimension table data is small, and another is LRU cache, which has a large amount of dimension table data.

Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension table

2) The second problem is that the data delayed by the stream is associated with the previous dimension table data.

This involves that dimension table data needs to store snapshot data, so it is more suitable to use HBase as dimension table in such a scenario, because HBase naturally supports multiple versions of data.

Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension table

2. The SQL syntax of parsing stream and dimension table join is transformed into the underlying Flink API

Because flinksql has done most SQL scenarios, it is impossible for us to parse all the syntax of SQL and transform it into the underlying flinksql API.

So what we do is to parse the SQL syntax to find out whether there is a dimension table in the join table. If there is a dimension table, we will separate the statement of the dimension table of the join, generate a new datastream with Flink’s tableapi and streamapi, and join the datastream with other tables, so that we can use SQL to realize the join syntax of flow and dimension table.

Apache invoke is the tool for SQL parsing, and Flink also uses this framework for SQL parsing. So all syntax can be parsed.

1)DEMO SQL

insert
into

  MyResult 
  select 
        d.channel, 
        d.info 
  from 
         (       select a.*,b.info 
          from 
                   MyTable a 
          join sideTable b 
                   on a.channel=b.name     
          where a.channel = 'xc2’ 
                      and a.pv=10     ) as d 

2) Calculate parses the insert into statement and splits the sub statement

select a.*,b.info from MyTable a join sideTable b on a.channel=b.name

  where a.channel = 'xc2' and a.pv=10

select d.channel, d.info from d

insert into MyResult

3) Calculate continues to parse the select statement

old: select a.*,b.info from MyTable a join sideTable b on a.channel=b.name
where a.channel = ‘xc2’ and a.pv=10

Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension table
Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension table
The data stack is a cloud native one-stop data center PAAS. We have an interesting open source project on GitHub and gitee: flinkx. Flinkx is a batch stream unified data synchronization tool based on Flink. It can collect both static data and real-time changing data. It is a global, heterogeneous and batch stream integrated data synchronization engine. Please order us a star if you like! star! star!

GitHub open source project:https://github.com/DTStack/fl…

Gitee open source project:https://gitee.com/dtstack_dev…