Introduction: This article is shared by Alibaba technical expert Zhou Kaibo (baoniu), which mainly introduces how to run through the first SQL.
1、 Basic concepts of SQL
1. SQL classification
SQL is divided into four categories: Data Query Language (DQL), data manipulation language (DML), data definition language (DDL) and data control language (DCL). Today we will introduce the use of the first three languages.
Next, several basic concepts are introduced.
2. SQL development
● Scripts, that is, SQL text. The first three languages described above can be written in SQL text;
● Schema, i.e. metadata. Tables and functions to be used in SQL are defined through schema;
● Artifacts, i.e. UDF jar package;
In Flink SQL, catalog manages metadata. Catalog locates a table through catalog.db.table. In addition to DB and table, catalog can also register functions such as UDF, udtf and udaf.
In Flink catalog, there are three catalog implementations:
● the first is genericinmemorycatalog, which is the memory version of the catalog. When using Flink SQL, the default is the in memory version of the catalog. When the program finishes running and runs again for the second time, a catalog instance will be regenerated.
● the second is hivecatalog, which is well supported in Flink. You can read metadata from hive HMS and write data to hive registry.
● the third catalog is the one developed in VVP platform, namely vvpcatalog. It implements the interface of Flink catalog, and the underlying layer is the database used.
Deployment is the description of a job. At present, there are two task types, jar and SQL.
There are upgrade strategy and restore strategy on deployment. Upgrade strategy means that after the deployment runs, the user can modify the deployment parameters. How this modification affects the operation of the deployment is determined by different upgrade strategies; Restore strategy refers to whether to restore from savepoint / checkpoint when starting Flink task, which is a different recovery strategy.
The version and configuration of Flink and the commonly used Flink parameters can be configured here. For example, the number of task managers, the CPU and memory of jobmanager and task manager, etc.
In addition to the job description, there are also expected and actual states on the deployment. The expected state refers to the target state expected by the user. For example, when a running job is to be stopped, the expected state is cancelled; The actual running state when the operation is completed is the actual state.
Generally speaking, deployment is a task description template. The status opportunity within the VVP platform controls the actual operation of the job according to the expected and actual status of the deployment.
When the deployment is started, a job will be generated, which corresponds to a specific Flink job. At the same time, there will only be one running job on a deployment.
2、 Syntax description of SQL
1. Syntax description
First look at the statements in the following figure, which are to create a source table and a result table.
The following figure shows the registration function. Function registration is divided into two steps. The first step is to upload jar package, and then check automatic registration on the system; The second is to register manually using Flink syntax.
There are two ways to use functions. The first is to use built-in functions, as shown in the figure below. Upper is Flink’s own function; The second is user-defined functions, such as myscalarfunc.
The VVP platform also supports the temporary table in Flink, which can be understood as a temporary table and is only valid in the current session cycle. In the following example, we created two temporary tables to read datagen_ The data in the source table is output to blackhole_ Sink table.
The following figure is a syntax example of temporary view. The first two paragraphs are the same temporary tables; The third statement creates a TMP_ View, which represents from datagen_ Source query. In Flink, temporary view can be understood as making the writing of SQL easier. It will not persist the data, which is different from the concept of view in the database. The fourth statement reads data from the view and writes it to the sink table.
The following figure shows the syntax example of statement set. This syntax is not available in Flink version 1.11, but it is supported on the VVP platform.
As shown in the figure above, multiple insert into statements can be written between begin statement set and end statements. The example above is reading datagen_ The source table is written to two sink tables at the same time. After this statement is submitted, a complete Flink job will be started with one source and two sink.
2. Application scope of SQL
Create table. The registered table will be written to the system catalog, and written to the vvpcatalog on the VVP platform for persistence. The advantage is that it is suitable for multiple queries to share metadata.
Create temporary table. It will be written to the in memory version of the catalog and will not be persisted. Therefore, it is suitable for scenarios that do not need to share metadata and is only used by the current query.
Create temporary view is mainly used to simplify SQL statements. If you do not use create temporary view, the writing of SQL with complex logic will be quite complex and the readability will be very poor.
Statement set, which is suitable for scenes that need to be output to multiple downstream.
3、 SQL practice
Next, let’s show you examples of sales statistics. As shown in the figure below, the demand is to count the hourly trading volume.
We first create two tables, one is the source table and the other is the result table. The following figure is the statement to create the source table. The data source is from Kafka, and then define the watermark for 5 seconds.
The following figure shows the result table and a Kafka table.
The following figure is a query statement. After reading the data from the source table, the data will be aggregated through the tumble window to make statistics, so as to calculate the trading volume per hour.
1. Practical demonstration
Open the VVP interface. There is an SQL editor on the left. There are three columns on the left of the editor. The first column is scripts, where SQL text is written; The second column is schema, which is used to define metadata; The third column is artifacts, which is used to register UDF.
First, define a datagen_ Source table. Click the verification button in the upper right corner, and click run next to it after passing the verification. After clicking run, you can see the running results below. After running successfully, click schemas on the left to find the datagen just created_ Source table.
Then create a sink table. The connector type is blackhole. Then verify and run.
In this way, both tables have been registered in the catalog. The default catalog name is VVP and the database name is default.
Next, you can write SQL statements. For example, write an insert into statement. After writing, click verify and run. When running this insert into statement, the system will prompt whether to create an SQL job. Click OK and supplement the name to create the SQL job.
Click to start this job. During the startup process, you can see a lot of information and configuration about this job on the page.
2. UDF actual display
After UDF development is completed, you will type a jar package, click artifacts on the left side of the SQL editor, and then click the “+” sign to upload the jar package.
After uploading, complete the jar package name and other information, and click OK to complete. During jar package uploading, the VVP system will parse the jar package. After parsing, the system will prompt whether to register. Check the content to be registered and click Create function.
Then, as shown in the figure below, the above is the registered function, and the following is the available function. You can choose to continue registration or close the window.
If you don’t need this function, you can find the registered function on the left side of the page, click the icon at the end of the right side, and select drop function. There are two ways to re register. First, click manage function; Second, register manually through Flink’s registration function.
Create SQL job with registered function.
You can see many advanced configurations from the create page drop-down. You can only modify the configuration you need.
3. Use of temporary table
The above example is registered in the catalog. If you don’t want to register in the catalog every time, you can directly use the temporary table.
As shown in the following figure, the creation of table and insert into are all written together, so that you can directly create a new SQL job without registering in the catalog in advance.
Copy the statements in the previous temporary View example page to the SQL editor of VVP platform, and directly click Run to create a job.
Copy the statements in the previous statement set example page to the editor, and directly click Run to create a job. After startup, you can see the operation through the following figure. This task reads data from one source table and outputs it to two different sink tables.
6. Query actual combat
Copy and paste the statements of creating source table, result table and query page in the previous SQL practice into the SQL editor of VVP platform and start running. As can be seen from the figure below, the read-write Kafka task is running.
Author: Zhou Kaibo (baoniu), Alibaba technical expert
This article is the original content of Alibaba cloud and cannot be reproduced without permission