Fligee is an open-source and heterogeneous data collection engine based on the X-Stream. Fligee can collect data in a batch of X-Stream, which can be synchronized in the whole cloud. It can also collect data in a batch of X-Stream. Please order us a star if you like! star！ star！
GitHub open source project:https://github.com/DTStack/fl…
Gitee open source project:https://gitee.com/dtstack_dev…
First of all, this article is based on Flink 1.5.4.
1、 Why do we extend Flink SQL?
Because Flink itself does not provide SQL syntax for docking input source and output destination. In the process of using data development, source and sink need to be written according to the API interface provided by it, which is extremely cumbersome. It is not only necessary to understand the APIs of various operators of flick, but also need to understand the relevant call methods of various components (such as Kafka, redis, Mongo, HBase, etc.), and there is no SQL related implementation method when it is necessary to associate with external data sources, Therefore, when data development directly uses Flink to write SQL as real-time data analysis, it requires a large additional workload.
Our goal is that when using Flink SQL, we only need to care about what to do, not how to do it. Do not need to care too much about the implementation of the program and focus on business logic.
Next, let’s take a look at the extended implementation of Flink SQL!
2、 What Flink related SQL has been extended
1. Create source table statement
2. Create output table statement
3. Create custom function
4. Dimension table Association
3、 How are the modules translated into the implementation of Flink
1. How to convert the SQL statement that creates the source table into the Flink operator
All tables in Flink will be mapped to the class table. Then call the registration method to register the table to the environment.
Currently, we only support Kafka data sources. Flink itself has the implementation class of reading Kafka, flinkkafkaconsumer09, so it only needs to instantiate the object according to the specified parameters. And call the registration method to register.
In addition, it should be noted that rowTime and procime are often used in Flink SQL, so we add additional rowTime and procime to the registry structure.
When you need to use rowTime, you need to specify datastream Watermarks (assign timestamps and watermarks). Custom watermarks mainly do two things: 1. How to get the time field from row. 2: Set the maximum delay time.
2. How to convert the SQL statement of the created output table into a Flink operator
The base class of Flink output operator is outputformat. We inherit richoutputformat here. This abstract class inherits outputformat and additionally implements the method getruntimecontext() to obtain the running environment, which is convenient for us to customize metric and other operations later.
Let’s take MySQL sink output to MySQL plug-in as an example, which is divided into two parts:
Resolve the create table to the table name, field information and MySQL connection information.
This section uses regular expressions to convert the CREATE TABLE statement into an internal implementation class. This class stores the table name, field information, plug-in type and plug-in connection information.
Inherit richoutputformat to write data to the corresponding external data source.
It mainly implements the writerecord method. In MySQL plug-in, it actually calls JDBC to implement the insertion or update method.
3. How to convert user-defined function statements into Flink operators;
Flink provides two types of implementations for UDF:
1) Inherit scalarfunction
2) Inherit tablefunction
What you need to do is add the jar provided by the user to urlclassloader, load the specified class (the class path implementing the above interface), and then call tableenvironment registerFunction(funcName, udfFunc)； The registration of UDF is completed. After that, you can use the UDF defined by the change;
4. How is the dimension table function implemented?
A common requirement in flow calculation is to supplement fields for data flow. Because the data collected by the data collection end is often limited, it is necessary to complete the required dimension information before data analysis, but the current flash does not provide the SQL function of joining external data sources.
Several problems needing attention in realizing this function:
1) Dimension table data is constantly changing
When implementing, it is necessary to support the external data source that regularly updates the cache in memory, such as using LRU and other strategies.
2) IO throughput problem
If every piece of data received is serially sent to the external data source to obtain the corresponding associated records, the network delay will be the biggest bottleneck of the system. Here we choose the operator richasyncfunction dedicated by arigon to the Flink community. The operator uses asynchronous method to obtain data from external data sources, which greatly reduces the time spent on network requests.
3) How to parse dimension tables contained in SQL to the Flink operator
In order to parse the specified dimension table and filter conditions from SQL, using regular is obviously not an appropriate method. Need to match possibilities. It will be an endless process. Check the SQL parsing of the Flink itself. It uses invoke for SQL parsing. Parse the SQL into a syntax tree and search the corresponding dimension table through iteration; Then separate the structure of dimension table and non dimension table.
Through the above steps, you can use SQL to complete the common from Kafka source table, join external data source, and write to the specified external destination structure.