Data stack technology sharing: open source · data stack – extend flinksql to join streams and dimension tables

Time:2021-7-9

1. Extend flinksql to join streams and dimension tables

Data stack technology sharing: open source · data stack - extend flinksql to join streams and dimension tables

2、 Why extend flinksql?

1. Real time computing needs to be fully SQL based

SQL is the most widely used language in data processing. It allows users to state their business logic succinctly. It is very common to use SQL in big data batch computing, but there are not many real-time computing supporting SQL. In fact, using SQL to develop real-time tasks can greatly reduce the threshold of data development. In kangaroo cloud stack real-time computing module, we decided to realize complete SQL.

Advantages of using SQL in data calculation

☑ Declarative. Users only need to express what I want, as for how to calculate it is a matter of the system, users do not care.

☑ Automatic tuning. The query optimizer can generate the most available execution plan for the user’s SQL. Users don’t need to know about it to automatically enjoy the performance improvement brought by the optimizer.

☑ Easy to understand. Many people in different industries and fields understand SQL. The learning threshold of SQL is very low. Using SQL as a cross team development language can greatly improve efficiency.

☑ stable. SQL is a language with decades of history, is a very stable language, few changes. Therefore, when we upgrade the engine version, or even replace it with another engine, we can upgrade it compatibly and smoothly.

Reference link:https://blog.csdn.net/weixin_…

2. Real time computation also needs the join of stream and dimension table

Data stack technology sharing: open source · data stack - extend flinksql to join streams and dimension tables

In the world of real-time computing, not only the join between streams, but also the join between streams and dimension tables. Last year, during the development of V3.0 version of kangaroo cloud stack, flinksql, the latest version at that time, has applied the advantages of SQL to the Flink engine, but it does not support the join of flow and dimension table.

Flinksql started to serve Alibaba Group’s open stream computing service in July 2017. Although it is a very young product, it has supported thousands of jobs during the double 11 period. During the double 11 period, the peak value of blink jobs reached 500 + million per second, and the total peak value of flinksql jobs alone reached 300 million per second.

Reference link:https://yq.aliyun.com/article…

Let’s first explain what dimension table is; Dimension table is a dynamic table. The data stored in the table may not change or be updated regularly, but the update frequency is not very frequent. In business development, the general dimension table data is stored in relational databases such as mysql, Oracle, etc., or in NoSQL databases such as HBase, redis, etc.

3、 Join flow and dimension table step by step with flinksql

1. Using Flink API to realize the function of dimension table

To realize the dimension table function, we need to use the function of Flink aysnc I / O, which was contributed by Alibaba to Apache Flink.

Data stack technology sharing: open source · data stack - extend flinksql to join streams and dimension tables

Async I / O is contributed to the community by Alibaba. It was introduced in version 1.2. Its main purpose is to solve the problem that network delay becomes the bottleneck of the system when interacting with external systems.

See this article for detailshttp://wuchong.me/blog/2017/0…

The API corresponding to Flink is the abstract class richasyncfunction, which implements the open (initialization), asyncinvoke (asynchronous data call), close (stop operation) methods, and the most important is to implement the methods in asyncinvoke.

There are two problems in the join of stream and dimension table

1) The first is performance.

Because if the flow rate is very fast, every piece of data needs to join in the dimension table, but the data in the dimension table exists in a third-party storage system. If you access the third-party storage system in real time, not only will the performance of the join be poor, but you have to go through the network every time; It will also bring a lot of pressure to the third-party storage system, which may hang up.

Therefore, the solution is to cache the data in the dimension table, which can be fully cached. This is mainly due to the situation that the dimension table data is not large, and another is the situation that the LRU cache has a large amount of dimension table data.

Data stack technology sharing: open source · data stack - extend flinksql to join streams and dimension tables

2) The second problem is that the data delayed by the stream is associated with the previous dimension table data.

This involves storing snapshot data for dimension table data, so it is more suitable to use HBase as dimension table in such a scenario, because HBase naturally supports multiple versions of data.

Data stack technology sharing: open source · data stack - extend flinksql to join streams and dimension tables

2. The SQL syntax of parsing stream and dimension table join is transformed into the underlying Flink API

Because flinksql has done most of the SQL scenarios, we can’t parse all the syntax of SQL and transform it into the underlying flinksql API.

So what we do is to parse the SQL syntax to find out whether there is a dimension table in the join table. If there is a dimension table, we will separate the statement of the dimension table of the join, generate a new datastream with Flink’s tableapi and streamapi, and join the datastream with other tables. In this way, we can use SQL to realize the join syntax of the stream and dimension table.

The tool for SQL parsing is Apache compute, and Flink also uses this framework for SQL parsing. So all grammars can be parsed.

1)DEMO SQL

insert
into

  MyResult 
  select 
        d.channel, 
        d.info 
  from 
         (       select a.*,b.info 
          from 
                   MyTable a 
          join sideTable b 
                   on a.channel=b.name     
          where a.channel = 'xc2’ 
                      and a.pv=10     ) as d 

2) Calculate parses the insert into statement and splits it into sub statements

select a.*,b.info from MyTable a join sideTable b on a.channel=b.name

  where a.channel = 'xc2' and a.pv=10

select d.channel, d.info from d

insert into MyResult

3) Calculate continues to parse the select statement

old: select a.*,b.info from MyTable a join sideTable b on a.channel=b.name
where a.channel = ‘xc2’ and a.pv=10

Data stack technology sharing: open source · data stack - extend flinksql to join streams and dimension tables
Data stack technology sharing: open source · data stack - extend flinksql to join streams and dimension tables
Data stack is a cloud native station data platform PAAS. We have an interesting open source project on GitHub and gitee: flinkx. Flinkx is a unified data synchronization tool based on Flink, which can collect both static data and real-time data. It is a global, heterogeneous, batch flow integrated data synchronization engine. If you like, please give us a star! star! star!

GitHub open source project:https://github.com/DTStack/fl…

Gitee open source project:https://gitee.com/dtstack_dev…